LostInTechnicality: October 2012

Wednesday, October 31, 2012

Lookup Performance Tuning

Lookup Performance Tuning

If the number of records coming from source is comparatively much lesser than the number of records present in the lookup table then you should consider using uncached lookup. This is because less number of records from source ensures less number of database calls.
If the lookup table is small (less than 2 million), it’s generally a good idea to cache the lookup table.

Effect of suppressing default ORDER BY clause in Lookup
When Informatica fires the lookup query to database, it appends a ORDER BY clause at the end of the query. However you can suppress this default behaviour by appending a comment “--“at the end of the override query. You should consider following factors regarding default Informatica Lookup ORDER BY –

1. ORDER BY clause may create additional processing overhead on database. When Informatica lookup fires the SELECT query to the database, it includes all the lookup ports in the SELECT query and in the ORDER BY clause. Hence ensure that there is no unnecessary port in the lookup. Because otherwise those ports will be included in the Informatica generated SELECT statement and also in ORDER BY clause, thus creating additional overhead in database

2. Also, do not completely suppress the ORDER BY with '--' (see the note below to know why)

3. If you completely suppress the ORDER BY from the Lookup query then Informatica will generate a Warning like below in the session log –

The above warning requires little explanation. If you suppress the ORDER BY, then data fetched from database will not be sorted. Hence Informatica will require to again sort this data in its own cache space in order to generate the Index cache (yes! index cache can not be built on unsorted data). This means in any case, data has to be sorted - either from database or inside the Informatica. From our experience we have noticed that it's better to get data sorted in database level than in informatica level.

However, what you can probably do is: you can suppress the default ORDER BY with '--' and write your own ORDER BY only on those columns (lookup ports) that are being used in the joining. Let's say, you are performing a lookup in the "Employee" table based on emp_id and returning name, salary, department and date_of_joining columns. If you don't do anything, the default Lookup query that Informatica sends to database will be:

SELECT emp_id, name, salary, department, date_of_joining
FROM Employee
ORDER BY emp_id, name, salary, department, date_of_joining

But you know that you do not need to perform the ORDER BY on all those columns since joining column is emp_id and index cache will be built on only that column. Hence you can put your own order by in the SQL Override and suppress the default ORDER BY, like below:

SELECT emp_id, name, salary, department, date_of_joining

FROM Employee

ORDER BY emp_id --

I recommend you always perform ORDER BY at least on the columns which are being used in the joining conditions in the lookup (i.e. lookup ports used in joining)

Remember Informatica creates its own index on the joining columns and therefore if the joining columns are ordered, Informatica would require less space (and time) to create the indexes.

Use of Persistent Lookup Cache

Persistent Cache lookup can be Holy Grail for the cached lookups – if you know how to use them. Generally speaking, if the source data in the underlying lookup tables are not changing between consecutive sessions runs then one may use persistent lookup cache.

In short, to use a persistent cache, you have to check the following options in Lookup Transformation properties tab:

§ Lookup caching enabled
§ Lookup cache persistent

Once you do that, cache file created by Informatica session will *NOT* be deleted from the Cache directory and the same cache file will be used in all the consecutive runs. Advantage of doing this is you need not spend time building the same cache every time the session executes. However if the source data for the lookup changes meanwhile then you must refresh the cache by either of the following two options:

§ Delete the cache file manually from the Cache directory Or,
§ Check the “Re-cache from lookup source” option in the Properties tab of the lookup

If the persistent cache lookup is reusable, then you can share the same lookup in multiple mapping without rebuilding the cache in each one of them. And you can have one additional mapping with re-cache option enabled for this lookup, which you can run whenever you need to refresh the cache file.

Disadvantage of Persistent cache Lookup

Although persistent cache can give you considerable performance and other advantages, it comes with some hazards.

§ Recovering sessions after failure in midway may not be possible. Consider this – you have a typical update-else-insert logic in a target table over which you also have persistent cache lookup. This PC lookup on target is used to determine if a certain record coming from source is already present in target (Update) or not (Insert). Suppose this session got failed after inserting a few record in target. If this was a normal lookup, you could simply restart the session after fixing the cause of the error. But if this is a persistent cache lookup, you can not restart this job directly as because the lookup cache will not be holding the records that got inserted in the first time and as a result the lookup will fail to determine that these records are already existing in target and your mapping will try to insert them once again.

§ Persistent cache gives you performance boost by saving time in building the lookup while session run but it still takes the same time like a normal lookup for using the lookup data during session runtime. It is often observed that persistent cache shared across many sessions creates huge disk level I/O contention when the lookup is actually being used in the sessions. You need to monitor the disk IO performance using “iostat” or “vmstat” (UNIX) if you see huge transformation bottleneck without any apparent reason in sessions using persistent cache lookup.

Effect of choosing connected OR Unconnected Lookup

If all the other factors remain same, choice of connected and unconnected lookup can impact lookup performance in one particular case. In case of a connected lookup, since the lookup is connected in the data flow pipeline so the lookup will be called for each record coming from source, irrespective of the fact whether the data returned from lookup is at all used later in the mapping or not. In case of unconnected lookup, this can be controlled by calling the lookup only when it is really needed.

To illustrate the difference, let’s consider the following example. Suppose you are loading sales data and if the sales is done in one specific geography location (say North America), then you want to populate one additional column in your target table with GST amount (which is based on which American State the sales was made). Suppose one record in the source is like this:

SalesID = 100067345

SalesLocation = CA

Geography = North America

SalesAmount = 100.54

ProductPartNo = PX937

In case of connected lookup, the lookup will be called
for all the records irrespective of the fact whether the sales Geography is in North America or not. In case of unconnected lookup, you can write
something like this inside an Expression Transformation, which will ensure the
lookup is only called when Sales Geography is North America:

IIF (SALES_GEOGRAPHY = ‘North America’,

:LKP.LKP_GET_GST_PCT_PER_STATE(SALES_LOCATION), 0)

Suppose out of 1,000,000 records, the above condition matches only for 100,000 records (10%). Then you save calling the lookup unnecessarily for 90% of cases.

WHEN TO set Pre-build Lookup Cache OPTION (AND WHEN NOT TO)

Informatica may build the lookup cache even before the first row reaches the Lookup transformation (Pre-build scenario) or it may build the lookup cache only when the first row reaches the lookup transformation (On demand scenario). You can let Informatica decide what it wants to do (pre-build or on demand) or you may explicitly tell Informatica what you want to do. All you need to do is set the “Pre-Build Lookup Cache” option to “Always allowed” under the lookup transformation Properties tab to force Informatica build all the lookup caches beforehand.

So what’s the difference? How does it impact the performance? Generally speaking pre-building all the lookup cache beforehand may give you occasional performance boost as once the source qualifier starts pulling data, the processing of that data need not wait any more for the completion of building of lookup caches.

Performance boost of pre-building lookup caches is more apparent in cases where the source query itself is taking considerable amount of time to return rows so the same time can be simultaneously used to pre-build all the lookup caches instead of waiting for the source to return results and then starting to build other lookup caches on demand.

I said the performance benefit is occasional because there are number of other factors that need to be considered before setting Pre-Build Lookup Cache to Always allowed.

· When Informatica pre-builds the lookup cache, how many caches it can build concurrently depends on various factors including “Additional Concurrent Pipelines for Lookup Cache Creation” property present in session level under “Config Object” tab. I will discuss this in more detail later below.

· One should also check if the database where Informatica is firing the Lookup queries has enough capacity to process the Lookup queries simultaneously without hampering the individual query performance. This depends on various database level properties especially on database TEMP space availability, session level connection restrictions and degree-of-parallelism (DOP)

· There can be scenario where not all the lookups are even required to be called (especially if the lookup is unconnected and called based on a condition – see above in the section “Effect of Choosing Connected or unconnected lookup). There is really no point in pre-building those lookups as those lookups should be built on demand only. This is also the reason why “Pre-Build Lookup Cache” option is available in individual lookup level and not as a generic option in session level

USE OF Additional Concurrent Pipeline for Lookup Cache Creation OPTIONS

To me, this option which is available in session level under “Config Object” tab, is more of a limiting option rather than a performance augmenting option. This option determines how many additional concurrent pipelines Integration service can use while trying to build lookup caches simultaneously. I prefer setting it to “Auto”.

If you set it to Auto, Informatica will decide the value at run time while building the caches concurrently. Below I describe how Informatica will typically behave when you set this to “Auto”. Suppose one mapping has 4 lookup transformations that Informatica decided to build concurrently. It will give a message like below first in the session log:

Enabled using [2 (auto)] additional concurrent pipelines to build lookup caches.

(Session likely will build or refresh [4] lookup caches; [1] on-demand only)

The above message means that Informatica has started 2 pipelines to handle 2 different lookups already. So this means 2 more lookups are yet to be built. When Informatica starts building those 2 more lookups, based on the situation whether the already allocated pipelines for the first 2 lookups are free by that time or not, it will generate any of the below two messages respectively:

Starting additional concurrent pipeline to build the lookup cache

needed by Lookup transformation [LKP_XXX_XXX_XXX]

Lookup cache of [LKP_XXX_XXX_XXX], previously started by an additional

concurrent pipeline, is now needed on demand.

The first message will appear if –

§ Informatica decides that it can “afford” one more additional pipeline and
§ No additional pipeline allocated previously for other lookups is yet free at the time of building this lookup

Now that we understand how additional pipeline behaves, let’s see how we can use this property to augment session performance. As I described before, the best thing to do with this is to set it to “Auto” which will anyway use the maximum possible pipelines when lookups are required to be built concurrently. However if you want to limit the concurrency then you can specify a numeric number there which will mean that no more additional pipelines will be created after it reaches the specified number. If any more additional pipeline is required, then it will wait for any current pipeline to free up first so that Informatica can reuse this. If you set it to 0 (zero), all lookups will be built serially.

Workflow Scheduling

Schedule a workflow to run continuously:-

You can schedule a workflow to run continuously. A continuous workflow starts as soon as the Integration Service initializes. If you schedule a real-time session to run as a continuous workflow, the Integration Service starts the next run of the workflow as soon as it finishes the first. When the workflow stops, it restarts immediately.

Alternatively for normal scenario you can create conditional-continuous workflow as below.
Suppose wf_Bus contains the business session that we want to run continuosly untill a certain conditions is meet before it stops, may be presence of file or particular value of workflow variable etc.

So modify the workflow as Start-Task followed by Decision Task which evaluates a condition to be TRUE or FALSE. Based on this condition the workflow will run or stop. Next use the link task to link the business session for $Decision.Condition=TRUE. For the other part use a Command Task for $Decision.Condition=FALSE. In the command task create a command to call a dummy workflow using pmcmd functionality.
e.g. "C:\Informatica\PowerCenter8.6.0\server\bin\pmcmd.exe" startworkflow -sv IS_info_repo8x -d Domain_hp -u info_repo8x -p info_repo8x -f WorkFolder wf_dummy

Next create another workflow name it as wf_dummy and again place a Command Task after the Start Task. Within the command task put the pmcmd command as
"C:\Informatica\PowerCenter8.6.0\server\bin\pmcmd.exe" startworkflow -sv IS_info_repo8x -d Domain_sauravhp -u info_repo8x -p info_repo8x -f WorkFolder wf_Bus

In this manner you can manage to run workflows continuosly, so the basic concept is to use two workflows and make them call each other.

2^nd way to do above task

You can also try this one.
Create a shell script, the content would be:
-----------------------------
while [ $variable_V1 == TRUE && $variable_V2 == TRUE ]
do

1) pmcmd command to trigger the WF
2) condition which will decide the value of "variable_V1"
3) *position* here you will set the value of "variable_V2" based on the status of the execution of your WF which ran previously.

done
exit 0
------------------------------
Every time if the condition is true, it will go inside the loop and will trigger the WF and if your condition is not met, it will come out of the loop.
this is just an example, you can modify this script according to your need. If you follow this approach, you don't have to schedule this WF using Informatica scheduler. Execute this script and this will keep on executing the WF until the condition is not met.

Additional:
Suppose you don't want to trigger the WF if the previous run has failed due to some reason. For this you can use unix command "$?" or you can use "gettaskdetails" with "pmcmd" and can include the status in your "do" condition of the while loop.
NOTE this part of the code will at this position: *position*

Sunday, October 21, 2012

MDM - Siperian Info

Master Your Data, Master Your Business

Informatica MDM is designed to make your master data management (MDM) initiative successful. Its flexible multidomain MDM technology and unique trust framework enable rapid deployments and extensibility across the enterprise. With a proven track record of success in tackling any data-driven business problem in any industry, Informatica MDM maximizes business value by delivering comprehensive, unified, open, and economical MDM on a single platform.

Maximize account penetration in MDM by improving cross-sell and up-sell offers with a single or an extended customer view.

Easily comply with regulatory initiatives in MDM by improving the tracking, transparency, and auditing of financial data.

Make decisions around MDM more confidently across the global enterprise through increased accuracy, reliability, and timeliness of business-critical data.

The new version of Business Data Director significantly improves the overall productivity of business users by allowing them to customize the look and feel of the application to suit their preferences. Other new enhancements include Business Data Director's ability to integrate with external applications and provide "custom action" capabilities, which allow users to extend the data to other applications. Using advanced data federation features, line of business users can see critical transactional data within Business Data Director that enables them to perform more complex, real-time business processes such as order to cash and cross-sell and up-sell of additional products to existing customers.

There are different component of Informatica MDM Siperian HUB mentioned below.

MRM
BDD
HR
Console Manager.
SIF

Start up of informatica Siperian HUB :-

Make sure Oracle services are started on your virtual/Hosted machine.
Open command window and enter the following commands to start the JBOSS application server
CD C:\Jboss*\bin
Run bat –b 0.0.0.0
Open the informatica MDM multidomain Edition Console, by opening internet explorer and go to http://infa-server : 8080/cmx and click the launch button on the page.
Login with user name and Password
Select the target database [TR_LAB].

Define Data Model

Create target base table object. [Ex ADDRESS table]
Enable History while creating staging table.
Define the relation ship of base table object with other base table base object [It’s like you define the schema for base object.]
There could be multiple source system for same base table object [like SALES and CRM]
Define target data model.
Metadata manager option helps in importing a data model from outside Informatica HUB. It can come from 3rd party vendor tool.
Naming convention should be correct for all the Landing, Staging and Target tables to maintain the metadata relation.
“Contains Full Dataset” Property related with Delta detection while loading in landing table.

WORKSHOP MENU [Configuration -> Metadata manager]

There are 2 types of schema in MDM Hub which is

System schema: - we do not create these; it got created at the time of installation of the siperian].
ORS schema [we use these schema]

Metadata manager is different than metadata from powercenter.

For importing any table from metadata manager, first we need to validate the ORS schema[database] then after you will be able to import table from different ORS.

Some Random Info about Siperian Process :-

Define the trust level.
Dynamic CELL LEVEL Survivorship.
IDD is how we consume the clean records from MDM Hub or create message queues.
We can use stored procedure or ETL to populate the landing table.
ETL must truncate the landing table before loading the landing table.
Duplicate records in the landing table will be rejected during the staging process. (REJECT/RAW tables)
There should be provision of auditing data lineage.
Base table = Normal Columns + System generated columns.
Data Model Elements
Provide functionality for matching and merging data. There are different tables which got created in the the HUB which helps in merging and matching records ex. OPL, MTCH, MGE, HREF, XREF tables
Built in Lineage (Cross-Ref tables) and history.
Support trust and validations rules.
Supports matching and merging.
ROWID_OBJECT is automatically generated and managed. [Port of base table]
Each base object has underlying cross-reference (XREF) table and it is created and managed by the system. Purpose is to cross reference with other base tables.
XREF is link between source key and base Object Key.
All the system tables start with C_REPO*.

About Staging Table

Intermediate work table.
Belongs to specific source system.
Staging table columns are a selected subset of user-defined columns in base object.
Staging table automatically has following columns

Last_update_Date
Source key
SRC_ROWID

Staging table will have only one source and one target.
Staging table has reject table. It can also have history table.

Relationship type: - you can create only virtual relationship. It can be one to many or many to many. (In many to many type relations, you need to create a intersection point)

Look Up: - When loading happens from staging to target base object table. Then Look up process also starts parallel. In other words look up in siperian is a translation of source’s primary or foreign key value into the corresponding base object value. This specifies the relationship between two different base objects. Siperian look up is different than informatica powercenter lookup. Hub system automatically handles look ups for primary key on base objects records, by using the base object’s X-REF. if no value is found, then the new ROWID_OBJECT value will be generated.

ROW_ID :- System generate their own row_id and connect it to the primary key which is coming from the source system and store this relationship in the XREF table (can have multiple records from different source system with some ROW_ID object[Logically both records will be same])

About Siperian:- Siperian acquired DELLOF, and then Informatica acquired Siperian.

Rohit..........

SQL Loader Steps

SQL Loader Utility of Oracle is a way to insert/append all outside data like CSV records to put into the Oracle table. This utility is invoked from Dos window.

Below is the Steps to perform the insert into oracle tables by SQL Loader

1) First create the CSV file by using excel sheet. Save the excel file as CSV file.

2) Keep that CSV file in the following path :- C:\DATA

3) Create the Control file for this load like below [sample one]

LOAD DATA

INFILE 'c:\data\Distinct_KEY.csv'

BADFILE 'Distinct_KEY.bad'

APPEND

INTO TABLE DISTINCT_KEY

FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' TRAILING NULLCOLS

(

KEYNUMBER,

FLAG

)

4) Go to the command prompt at the following path :- C:\DATA

5) Run the below command and then wait and watch JJ

sqlldr USER_ID/PASSWORD@INSTANCE c:\data\Distinct_KEY errors=300000

Rohit...

Data Analysis Case Study

RJ Corporate has initiated a program to drive a standard program to activate/inactivate irrelevant data in four system, which is XXX, YYY, ZZZ, DDD source systems. The data in these systems would be further be profiled and cleaned to maintain a single source of truth for all the supplier data received from these source applications.

The current source Applications has limitations in providing the necessary platform to maintain clean and single place for true supplier data. In addition, the existing data quality issues in the current application require that a strong DQ solutions need to be built which can full fill the required program objectives.

This requirement specifications provides data quality rules that needs to be created and maintaining clean supplier.

Process:-

We started with Source system ZZZ, which is one of the most updated and important source system, I suppose, because activation and deactivation of other records from different system depends on this ZZZ system.

So, I am the sole owner for all data analyst activity for this project, there are other guys, who is also working but, those are just plain activity,

First I decided the target table structure for ZZZ, because I was getting all the 13 Lakh records from a report generated by GE. I created a SP to load the target table SI from the table, where we dumped all the records from SI generated reports, and applied some of the validation rules and cleansing rules, and also some rejection criteria, so that we can have a good look at the SI_EXCEPTION_TABLE and come up with idea, what to do with these records, after loading the main DQ_SI table. We looked into the SI_EXCEPTION table and decided some more strategy and moved some of the exception records into main table after some cleansing and transformation. And we left some records into the exception records, because they were not able to pass any validation criteria.

Same thing we did for XXX, ZZZ and DDD.

Now we are working on the inter comparison between these four system for activation and inactivation of records between all the source system.

Tool :- Oracle 11g.

Analysis Process :- The final objective of this analysis process is to find out those records in XXX, which can be deactivated based on SI status flag. But there were single standard column, on which we could decide to do this activity. So we started the series of analysis based on different combination of fields between YYY and XXX like their likely key combination comparison, then their supplier name comparison, their address, city , and country combination.

Below is the process step by step.

First we included extra column named MATCH in XXX system to just find out how many records are matching between YYY and XXX based on different key columns comparison. [There were total 10 scenarios like that]. Why we included this extra column, because if we would have joined those table between YYY and XXX, we would not get the exact number of records.

After doing this, we extracted all the records for all the scenario based on the joined condition in an excel sheet.

Then we did some more comparison to find out the exact match, which was totally functional requirement like for some records their KEY was matching but their supplier name was not matching. So did extra comparison in the sheet to find out more number of matching records.

After this step, we have taken all the KEY columns value from all the different scenario sheet into single sheet, Which was our total final number of KEY column, which we are going to deactivate in the XXX system.

Some scenario:-

1) If we are dealing with more than millions of records in Oracle. Then it is really tough to update millions of records in small amount of time. Because Update operation is always time consuming. For our 3^rd system ZZZ , we needed 6 extra column to include in the main table, which was coming from all 6 different table. Since this requirement came at the later point of update. So there were no alternative to use update other than Update.

Solutions:- instead of updating all the 6 columns from 6 different table into the main table, we combined all the 6 table first into single table by using their key. And that combination was union of all the six different table incuding key columns and the extra required columns. We did group by over the union of six table and inserted into the new table. And it has taken only 2 seconds….and bingo… JJ check out the below query….

Insert into load_attr(ven_id, ven_si_id, org_id, amount, amount_date,payment)

      Select ven_id, ven_si_id, org_id, max(amount), max(amount_date), max(payment)
      From (
      Select ven_id, ven_si_id, org_id,
     Amount,
     ‘’ as amount_date,
     ‘’ as payment             from Invoice_1
     UNION ALL

     Select ven_id, ven_si_id, org_id,
     ‘’ asAmount,
     amount_date as amount_date,
     ‘’ as payment             from Invoice_2

     UNION ALL
     Select ven_id, ven_si_id, org_id,
     ‘’ asAmount,
     ‘’ as amount_date,
     payment as payment   from Invoice_3
     ) group by ven_id, ven_si_id, org_id

Then the problem was to update the main master table with the newly created table. Since we had some millions records in the master table, so again we had the same problem with update. So again we did the left outer join between the Master table and newly created table and inserted the resulted records in new table which had the same structure as the master table and again it worked well and all the load completed in just few seconds… JJ

So the bottom line is…never do update on millions of records in relational database such as oracle or SQL server. Instead, try to figure out how to perform that operation by insert statement.

Create cursors or stored procedure only if those operations can not be performed by normal PL/SQL queries. By this way you can optimize the performance and reduce the overall time by great extent.

Plzzzzzzz let me know for any new scenarios and views

Happy Analysis.......

Rohit...

LostInTechnicality