Sunday, October 21, 2012

Data Analysis Case Study


RJ Corporate has initiated a program to drive a standard program to activate/inactivate irrelevant data in four system, which is XXX, YYY, ZZZ, DDD source systems. The data in these systems would be further be profiled and cleaned to maintain a single source of truth for all the supplier data received from these source applications.

The current source Applications has limitations in providing the necessary platform to maintain clean and single place for true supplier data. In addition, the existing data quality issues in the current application require that a strong DQ solutions need to be built which can full fill the required program objectives.

This requirement specifications provides data quality rules that needs to be created and maintaining clean supplier.

Process:-

We started with Source system ZZZ, which is one of the most updated and important source system, I suppose, because activation and deactivation of other records from different system depends on this ZZZ system.

So, I am the sole owner for all data analyst activity for this project, there are other guys, who is also working but, those are just plain activity,

First I decided the target table structure for ZZZ, because I was getting all the 13 Lakh records from a report generated by GE. I created a SP to load the target table SI from the table, where we dumped all the records from SI generated reports, and applied some of the validation rules and cleansing rules, and also some rejection criteria, so that we can have a good look at the SI_EXCEPTION_TABLE and come up with idea, what to do with these records, after loading the main DQ_SI table. We looked into the SI_EXCEPTION table and decided some more strategy and moved some of the exception records into main table after some cleansing and transformation. And we left some records into the exception records, because they were not able to pass any validation criteria.

Same thing we did for XXX, ZZZ and DDD.

Now we are working on the inter comparison between these four system for activation and inactivation of records between all the source system.

Tool :- Oracle 11g.

Analysis Process :- The final objective of this analysis process is to find out those records in XXX, which can be deactivated based on SI status flag. But there were single standard column, on which we could decide to do this activity. So we started the series of analysis based on different combination of fields between YYY and XXX like their likely key combination comparison, then their supplier name comparison, their address, city , and country combination.

Below is the process step by step.

  1. First we included extra column named MATCH in XXX system to just find out how many records are matching between YYY and XXX based on different key columns comparison. [There were total 10 scenarios like that]. Why we included this extra column, because if we would have joined those table between YYY and XXX, we would not get the exact number of records.

  1. After doing this, we extracted all the records for all the scenario based on the joined condition in an excel sheet.

  1. Then we did some more comparison to find out the exact match, which was totally functional requirement like for some records their KEY was matching but their supplier name was not matching. So did extra comparison in the sheet to find out more number of matching records.

  1. After this step, we have taken all the KEY columns value from all the different scenario sheet into single sheet, Which was our total final number of KEY column, which we are going to deactivate in the XXX system.
Some scenario:-

1)     If we are dealing with more than millions of records in Oracle. Then it is really tough to update millions of records in small amount of time. Because Update operation is always time consuming. For our 3rd system ZZZ , we needed 6 extra column to include in the main table, which was coming from all 6 different table. Since this requirement came at the later point of update. So there were no alternative to use update other than Update.

 Solutions:- instead of updating all the 6 columns from 6 different table into the main table, we combined all the 6 table first into single table by using their key. And that combination was union of all the six different table incuding key columns and the extra required columns. We did group by over the union of six table and inserted into the new table. And it has taken only 2 seconds….and bingo… JJ check out the below query….


Insert into load_attr(ven_id, ven_si_id, org_id, amount, amount_date,payment)
      Select ven_id, ven_si_id, org_id, max(amount), max(amount_date), max(payment)
      From (
      Select ven_id, ven_si_id, org_id,
     Amount,
     ‘’ as amount_date,
     ‘’ as payment             from Invoice_1
     UNION ALL

     Select ven_id, ven_si_id, org_id,
     ‘’ asAmount,
     amount_date as amount_date,
     ‘’ as payment             from Invoice_2

     UNION ALL
     Select ven_id, ven_si_id, org_id,
     ‘’ asAmount,
     ‘’ as amount_date,
     payment as payment   from Invoice_3
     ) group by ven_id, ven_si_id, org_id


Then the problem was to update the main master table with the newly created table. Since we had some millions records in the master table, so again we had the same problem with update. So again we did the left outer join between the Master table and newly created table and inserted the resulted records in new table which had the same structure as the master table and again it worked well and all the load completed in just few seconds… JJ

So the bottom line is…never do update on millions of records in relational database such as oracle or SQL server. Instead, try to figure out how to perform that operation by insert statement.

Create cursors or stored procedure only if those operations can not be performed by normal PL/SQL queries. By this way you can optimize the performance and reduce the overall time by great extent.
 
 
Plzzzzzzz  let me know for any new scenarios and views
 
 
Happy Analysis.......
Rohit...

 

 

 

 

 

 

 

 

2 Comments:

At September 16, 2015 at 1:45 AM , Blogger Unknown said...

I actually enjoyed reading through this posting. Thanks for your posting

http://www.tekclasses.com/informatica-online-training-best-informatica-training-in-bangalore-informatica-training/


http://www.tekclasses.com/informatica-training-in-bangalore-informatica-training/

 
At September 16, 2015 at 1:46 AM , Blogger Unknown said...

hi you have gathered a valuable information on Informatica...., I am looking for content like this and i am much impressed with the information and nice course content, thanks a lot for the Information regarding Informatica
http://www.tekclasses.com/informatica-online-training-best-informatica-training-in-bangalore-informatica-training/


http://www.tekclasses.com/informatica-training-in-bangalore-informatica-training/

 

Post a Comment

Subscribe to Post Comments [Atom]

<< Home