Wednesday, August 29, 2012

Major Data Cleansing Challenge


Today I thought to write something about the major issues we face while data cleansing. Profiling is just the starting point. It gives the overview of all data discrepancies, but for cleansing purpose, we need to take a close look at the records, try to find out the patterns. And to find out the pattern, you need to think from the real life scenarios, what could be the problem. These records are gathered from different sources and god knows only, how many years it has taken to build this much of data.

So in today article, I gathered some info from different websites. Below article is not about how to cleanse the data. Its about, how to identify the bad records and you can gather more meaningful stats of bad data and categorize into different sections,



One more point, I want to highlight, that these records could be of any domain like customer, vendor, product etc. and also these all records belongs to different countries and all the countries has their own way of formatting the fields like, name, address, state, Zip Code and many different fields. And the main fight is about to bring everything on the same page, in one consistent format.



There will be sometime, you will think like, some field containing special symbols might be wrong data, but its not. But while finalizing the validation rules for each fields, it depends on the client, how he want to see the records.



Below article is based on Customer domain.



Customer Data Challenges



Customer Data Quality has additional set of challenges over and above the typical data quality issues



Names, when spoken, written and especially when entered into a computer system are subject to considerable variation and error. Although this error and variation can be reduced, it can't be entirely eliminated. Data such as dates of birth, ages, phone numbers and identity numbers are all used and all subject to unavoidable error and variation. Please refer to reasons for data quality challenges along with the following points. Along with these challenges, we have given the TIPs to avoid some of such issues. In real-life, you can only reduce but cannot eliminate these causes.



General Life can go on with bad customer data



In fact, people can cope quite well with errors in their name. In the case of addressing variations, postal workers also still manage to deliver the mail despite some horrendous errors., but how do computer systems cope with such error and variation specifically, how do they competently search and match this type of data? The answer is- if not given the appropriate attention, not all that well.



Impossible to define all rules for correct names and customer addresses



This is fundamentally because data integrity in computer systems relies on rules, constraints and reference data to enforce data accuracy. It is easy to see how error and

variation creeps in because:



• Names and addresses easily disobey rules and constraints, without the person entering the data even being aware.

• There are no complete dictionaries for names and addresses for computer systems to use.



The vocabulary in use for people’s first names includes in excess of 2,500,000 words in the USA alone. Yet as much as 80% of the population may have names from as few as 500 words. Family names are just as unevenly distributed.



TIP- Many systems have online search capability based upon the search algorithms, whereby they apply the algorithms of customer data search and matching and identify the possible matches to the name. It will allow you to check on the possible list, and see if the customer already exists.



Data Entry, Data Model and Processing Issues



Lack of Granularity: example- As Data model did not account for it, the address, telephone, ZIP are all clustered in a single field.



Lack of realism: example- The entry makes it mandatory to enter the middle name OR telephone OR Fax number, which in real world may not exist, leading to wrong entry.





Data Domain Leniency: Example- City is not selected out of a drop down, but is allowed free form leading to NY/New York/York.



Simple Data Entry Mistake

Example- Enter Fenmark instead of Denmark



Right field, but wrong purpose

Example- Enter street name in district, city in state field



Spill over data

Example- The name is too long and it spills over into the address field OR address field spills over into the state field.



Variations in Abbreviations

Example- The Data entry person OR the person go by his /her flow of abbreviations.

Avenue-AV-AVN/ WDC, WASH DC.



Mis-spellings due to Phonetics similarities

Example- This happens primarily during recordings of Telephone conversation.



TIP- Most of the above can be included in the input controls. These kind of controls like drop down of city, state etc, can make the page heavy as the data related to these drop downs is part of the page. However, given todays network and broad-band capacities, it does not have an impact on the response time.



Total control and training is not possible

When talking about an organization’s internal systems, attempts are made to control this problem at its source through the management and training of the people who enter OR use the data. The other way to control is through the design of the forms, screens and processes that capture it. There is a lot of discussion, theory and good practical advice available on this subject.



Unfortunately, the person entering the data OR the data user may be totally unaware that he is entering data that contains some error and variation. Also, not all situations allow this level of control. In Web based systems the person entering the data is usually outside the normal controls of the organization. If we hope to entice international customers OR interest, the web forms themselves need to allow entry of varying data formats without rigid constraints. Similarly, the quality of data files sourced from a third party is usually out of the control of the recipient organization.



Lack of Data Standards and Formats

This has been covered at length in Data Quality Assurance. When you have different formats used by different systems OR even by different modules in the same system, it leads to problem in matching.



Field format                                                  Field data      

First-middle-last-title                                     Harrison R Ford Professor

Last-First -Middle comma title                       Ford Harrison R., Professor

Title- Comma- First- Last Middle                   Professor, Harrison Ford R.





Deliberate wrong information

Securing privacy:-

As a customer, I want to use your service, but you want me to register. I will provide all the wrong information as long as it does not impact me. So to avoid unsolicited communication and not to share my telephone OR address for any ulterior purpose, I would fudge all the details. This generally happens for loyalty cards, web based registrations, 'get a prize' campaigns. Customer turns out to be smarter than the Database creators.



Trying to secure a service OR financial benefit

If one wants to get one more credit card from the same company OR apply for any scheme where only one application is allowed, you can fudge your records. This is done normally in those cases where the verification is done at the time of final allotment and issuance and not on the verification.



Avoid scrutiny OR past record

A black listed supplier can fill-up his proposal specifying his name and address in a different format. Or he can place a different name (it does not take long to float a new business entity) with the other things remaining the same.



Customer convenience vs. the data quality

You would like to have the data entry forms with en-marked boxes, and you can ask the customers to fill in black OR blue ink in capital letters. However, if customer has no obligation, chances are the he/she will not fill-up OR not follow the rules. In a non-obligatory situation, customer will fill-up the forms fast and in running hand-writing, good part of which could be illegible.



Lack of central ownership of the customer data

In any large enterprise, different business owners of geographies, channels and campaigns use different sources and database to acquire new clients. There is no central ownership of the customer data, OR even the data quality effort is sometime sliced at channel OR geography level. There are many examples where the customer (OR an employee OR a supplier) post-rejection at one location, can get acceptance at other location by making little change in the data.



TIP: Please refer data quality organization. One needs to assign a data custodian for having the business ownership of certain data group (like customer), and data steward to ensure the data quality of customer data. Data quality needs to be owned by business and to be supported by IT.











Time Decay

Customer marital status, number of dependents, income, address etc. get changed over time. There is no way that you can ensure the currency of the data without a sustained customer contact program and also through 'experiential ' indicators like return mail. You can also have 'intelligent' guesses like person might be having 20% higher income over five years OR if a person has crossed age of 30 he might have become married. (some of the examples may not be applicable across all cultures and locations). You can use these kind of guesses selectively to update customer data.



Technically correct vs. real life

A name in your social security card is seldom used in real-life and that gets reflected in the quality issues. The instances are as follows:



• Pronunciation – 'Sean' written as 'Shon' by the tele customer service executive.

• Nick Name – 'Willian' written as 'Bill' OR 'Robert' written as 'Bob' OR 'rob'

• Conventions – 'Barrett Junior' written as 'Barrett Jr.'

• Abbreviations – 'Ann Lemont Suez' written as 'A Lemont Suez' OR 'A L Suez'

• Shuffling middle, first and last name- 'Ann Lemont Suez' written as 'Lemont A Suez'        OR 'Ann L Suez'

Monday, August 20, 2012

Way of Profiling Analysis

Basically 3 kind of analysis.
1.       Business Rule
2.       Functional Depedency
3.       Column set analysis


Aggregation
The Aggregation tab displays all the essential measures for each column, such as minimum and maximum values, number of distinct values,
and null values. Some measures are available only for specific data types. These include the average, median, and standard deviation measures.
Information can be viewed from either the Tabular or Graphical subtabs.


Measurement                   Description
               
Column                                Name of the column, within the profiled data object, for which  data profiling determined the aggregation measures
Minimum                            Minimum value for the inherent database ordering for the column
Maximum                           Maximum value for the inherent database ordering of the column
#Distinct                              Total number of distinct values for the column
% Distinct                            Percentage of distinct values in the column over the entire row set
Not Null                               Indicates if a NOT NULL constraint is defined in the database for the column
RecommendedNOT NULL Indicates if data profiling results recommend that the column enable null values. A value of Yes represents a recommendation that this column should not enable null values
#Nulls                                   Total number of null values for the column
%Nulls                                  Percentage of null values, for the column, over the entire row set
Six Sigma                             Number of null values (defects) to the total number of rows in the table (opportunities)
Average                               Average value, for the column, in the entire row set
Median                                Median value, for the column, in the entire row set
Std Dev                                Standard deviation for the column



Data Type

The Data Type tab provides profiling results about data types. This includes metrics such as length for character data types and the precision and scale for numeric data types. For each data type that is discovered, the data type is compared to the dominant data type found in the entire attribute and the percentage of rows that follows the dominant measure is listed. One example of data type profiling would be finding a column defined as VARCHAR that stores only numeric values.  Changing the data type of the column to NUMBER would make storage and processing more efficient.
  
Measurement                   Description
Columns                              Name of the column, within the data object, for which data type   analysis was performed
Documented Data Type                Data type of the column in the source object
Dominant Data Type       From analyzing the column values, data profiling determines that this is the dominant (most frequent) data type.
%Dominant Data Type   Percentage of total number of rows where column value has the dominant data type
Documented Length      Length of the data type in the source object
Minimum Length             Minimum length of the data stored in the column
Maximum Length            Maximum length of the data stored in the column
Dominant Length             From analyzing the column values, data profiling determines that this is the dominant (most frequent) length.
%Dominant Length         Percentage of total number of rows where column value has the dominant length
Documented Precision  Precision of the data type in the source object
Minimum Precision         Minimum precision for the column in the source object
Maximum Precision        Maximum precision for the column in the source object
Dominant Precision         From analyzing the column values, data profiling determines that this is the dominant (most frequent) precision
%Dominant Precision     Percentage of total number of rows where column value has the dominant                                                        precision
Documented Scale          Scale specified for the data type in the source object
Minimum Scale                 Minimum scale of the data type in the source object
Maximum Scale                                Maximum scale of the data type in the source object
Dominant Scale                 From analyzing the column values, data profiling determines that this is the                                                        dominant (most frequent) scale.
%Dominant Scale             Percentage of total number of rows where column value has the dominant scale.

Domain
The Domain tab displays results about the possible set of values that exist in a certain attribute. Information can be viewed from either the Tabular or Graphical subtabs.

The process of discovering a domain on a column involves two phases. First, the distinct values in the column are used to determine whether that column might be defined by a domain. Typically, there are few distinct values in a domain. Then, if a potential domain is identified, the count of distinct values is used to determine whether that distinct value is compliant with the domain. The properties that control the threshold for both phases of domain discovery can be set in the Property Inspector. If you find a result to know more about, drill down and use the Data Drill panel to view details about the cause of the result. For example, a domain of four values was found for the column REGION_ID: 3,2,4, and 1. To see which records contributed to this finding, select the REGION_ID row and view the details in the Data Drill panel.

Domain Results:-

Measurement                   Description

Column                              Name of the column for which domain discovery was performed
Found Domain                  The discovered domain values
% Compliant                       The percentage of all column values that are compliant with the  discovered domain values
Six Sigma                             The Six Sigma value for the domain results

Pattern
The Pattern tab displays information discovered about patterns within the attribute. Pattern discovery is the profiler's attempt at generating regular expressions for data that it discovered for a specific attribute. Non-English characters are not supported in the pattern discovery process.

Pattern Results:-

Measurement                                   Description
Columns                                              Name of the column for which pattern results were discovered
Dominant Character Pattern       The most frequently discovered character pattern or   consensus pattern
% Compliant                                       The percentage of rows whose data pattern agrees with the dominant character pattern
Dominant Word Pattern               The most frequently discovered word character pattern or consensus pattern
% Compliant                                       The percentage of rows whose data pattern agrees with the dominant word pattern
Common Format                              Name, Address, Date, Boolean, Social Security Number, E-mail, URL. This is the profiler's attempt to add semantic understanding to the data that it sees. Based on patterns and some other techniques, it tries to determine which domain bucket a certain attribute's data belongs to.
% Compliant                                       The percentage of rows whose data pattern agrees with the consensus common format pattern

Unique Key
The Unique Key tab provides information about the existing unique keys that were documented in the data dictionary, and possible unique keys or key combinations that were detected by the data profiling operation. The uniqueness % is shown for each. The unique keys that have No in the Documented column are the ones that are discovered by data profiling.

For example, a phone number is unique in 98% of the records. It can be a unique key candidate, and you can then cleanse the noncompliant records. You can also use the drill-down feature to view the cause of the duplicate phone numbers in the Data Drill panel. Table 18-7 describes the various measurement results available in the Unique Key tab.

Measurement                                   Description
Unique Key                                        The discovered unique key
Documented                                     Indicates if a unique key on the column exists in the data dictionary.
                                                                A value of Yes indicates that a unique key exists in the    data dictionary.
A value of No indicates that the unique key was discovered because of data profiling.
Discovered                                         From analyzing the column values, data profiling determines whether a unique key should be created on the column listed in the Local Attribute(s) column.
Local Attribute(s)                             The name of the column in the data object that was profiled.
# Unique                                             The number of rows, in the source object, in which the attribute represented by Local Attribute is unique
% Unique                                            The percentage of rows, in the source object, for which the attribute represented by Local Attribute is unique
Six Sigma                                             Number of null values (defects) to the total number of rows in the table (opportunities)



Functional Dependency
The Functional Dependency tab displays information about the attribute or attributes that seem to depend on or determine other attributes. Information can be viewed from either the Tabular or Graphical subtabs. You can use the Show list to change the focus of the report. Unique keys defined in the database are not discovered as functional dependencies during data profiling.

Functional Dependency Results:-

Measurement                                   Description
Determinant                                      Name of the attribute that is found to determine the attribute listed under Dependent
Dependent                                         Name of the attribute that is found to be determined by the value of another attribute
# Defects                                             Number of values in the Determinant attribute that were not determined by the Dependent attribute
% Compliant                                       Percentage of values that are compliant with the discovered dependency
Six Sigma                                            Six Sigma value
Type                                                      Type of functional dependency. Possible values are unidirectional or bidirectional           


Referential
The Referential tab displays information about foreign keys that were documented in the data dictionary, and relationships discovered during profiling.
For each relationship, you can see the level of compliance. Information can be viewed from both the Tabular and Graphical subtabs.  In addition, two other subtabs are available only in the Referential tab: Joins and Redundant Columns.

Referential Results:-

Measurement                                   Description
Relationship                                       Name of the relationship
Type                                                      Type of relationship. The possible values are Row Relationship and Foreign Key.
Documented                                     Indicates if a foreign key exists on the column in the data dictionary, Yes indicates that a foreign key on the column exists in the data dictionary. No indicates that the foreign key was discovered because of data profiling.
Discovered                                         From analyzing the column values, data profiling determines whether a foreign key should be created on the column represented by Local Attribute(s).
Local Attribute(s)                             Name of the attribute in the source object
Remote Key                                       Name of the key in the referenced object to which the local attribute refers
Remote Attribute(s)                       Name of the attributes in the referenced object
Remote Relation                              Name of the object that the source object references
Remote Module                               Name of the module that contains the referenced object
Cardinality Range                             Range of the cardinality between two attributes.For example, the EMP table contains 5 rows of employee data.
There are two employees each in department 10 and 20 and one employee in department 30.
The DEPT table contains three rows of department data with deptno value 10, 20, and 30.

Data profiling finds a row relationship between the EMP and DEPT tables. The cardinality range is 1-2:1-1.
This is because in EMP, the number of rows per distinct value ranges from 1 (for deptno 30) to 2 (deptno 10 and 20).
                                                                In DEPT, there is only one row for each distinct value (10, 20, and 30).

# Orphans                                           Number of orphan rows in the source object
% Compliant                                       Percentage of values that are compliant with the discovered dependency
Six Sigma                                             Number of null values (defects) to the total number of rows in the table (opportunities)


Note:- The Joins subtab displays a join analysis on the reference selected in the Reference list.
The results show the relative size and exact counts of the three possible outcomes for referential relationships: joins, orphans, and childless objects.

Saturday, August 18, 2012

Basic Data Cleansing Issue


Today I thought to write something about the major issues we face while data cleansing. Profiling is just the starting point. It gives the overview of all data discrepancies, but for cleansing purpose, we need to take a close look at the records, try to find out the patterns. And to find out the pattern, you need to think from the real life scenarios, what could be the problem. These records are gathered from different sources and god knows only, how many years it has taken to build this much of data.

So in today article, I gathered some info from different websites. Below article is not about how to cleanse the data. Its about, how to identify the bad records and you can gather more meaningful stats of bad data and categorize into different sections,

One more point, I want to highlight, that these records could be of any domain like customer, vendor, product etc. and also these all records belongs to different countries and all the countries has their own way of formatting the fields like, name, address, state, Zip Code and many different fields. And the main fight is about to bring everything on the same page, in one consistent format.

There will be sometime, you will think like, some field containing special symbols might be wrong data, but its not. But while finalizing the validation rules for each fields, it depends on the client, how he want to see the records.

Below article is based on Customer domain.

Customer Data Challenges

Customer Data Quality has additional set of challenges over and above the typical data quality issues

Names, when spoken, written and especially when entered into a computer system are subject to considerable variation and error. Although this error and variation can be reduced, it can't be entirely eliminated. Data such as dates of birth, ages, phone numbers and identity numbers are all used and all subject to unavoidable error and variation. Please refer to reasons for data quality challenges along with the following points. Along with these challenges, we have given the TIPs to avoid some of such issues. In real-life, you can only reduce but cannot eliminate these causes.

General Life can go on with bad customer data

In fact, people can cope quite well with errors in their name. In the case of addressing variations, postal workers also still manage to deliver the mail despite some horrendous errors., but how do computer systems cope with such error and variation specifically, how do they competently search and match this type of data? The answer is- if not given the appropriate attention, not all that well.

Impossible to define all rules for correct names and customer addresses

This is fundamentally because data integrity in computer systems relies on rules, constraints and reference data to enforce data accuracy. It is easy to see how error and

variation creeps in because:
• Names and addresses easily disobey rules and constraints, without the person entering the data even being aware.

• There are no complete dictionaries for names and addresses for computer systems to use.

The vocabulary in use for people’s first names includes in excess of 2,500,000 words in the USA alone. Yet as much as 80% of the population may have names from as few as 500 words. Family names are just as unevenly distributed.

TIP- Many systems have online search capability based upon the search algorithms, whereby they apply the algorithms of customer data search and matching and identify the possible matches to the name. It will allow you to check on the possible list, and see if the customer already exists.

Data Entry, Data Model and Processing Issues

Lack of Granularity: example- As Data model did not account for it, the address, telephone, ZIP are all clustered in a single field.

Lack of realism: example- The entry makes it mandatory to enter the middle name OR telephone OR Fax number, which in real world may not exist, leading to wrong entry.

Data Domain Leniency: Example- City is not selected out of a drop down, but is allowed free form leading to NY/New York/York.

Simple Data Entry Mistake

Example- Enter Fenmark instead of Denmark

Right field, but wrong purpose

Example- Enter street name in district, city in state field

Spill over data

Example- The name is too long and it spills over into the address field OR address field spills over into the state field.

Variations in Abbreviations

Example- The Data entry person OR the person go by his /her flow of abbreviations.

Avenue-AV-AVN/ WDC, WASH DC.

Mis-spellings due to Phonetics similarities

Example- This happens primarily during recordings of Telephone conversation.

TIP- Most of the above can be included in the input controls. These kind of controls like drop down of city, state etc, can make the page heavy as the data related to these drop downs is part of the page. However, given todays network and broad-band capacities, it does not have an impact on the response time.

Total control and training is not possible

When talking about an organization’s internal systems, attempts are made to control this problem at its source through the management and training of the people who enter OR use the data. The other way to control is through the design of the forms, screens and processes that capture it. There is a lot of discussion, theory and good practical advice available on this subject.

Unfortunately, the person entering the data OR the data user may be totally unaware that he is entering data that contains some error and variation. Also, not all situations allow this level of control. In Web based systems the person entering the data is usually outside the normal controls of the organization. If we hope to entice international customers OR interest, the web forms themselves need to allow entry of varying data formats without rigid constraints. Similarly, the quality of data files sourced from a third party is usually out of the control of the recipient organization.

Lack of Data Standards and Formats

This has been covered at length in Data Quality Assurance. When you have different formats used by different systems OR even by different modules in the same system, it leads to problem in matching.

Field format                                                  Field data      

First-middle-last-title                                     Harrison R Ford Professor

Last-First -Middle comma title                       Ford Harrison R., Professor

Title- Comma- First- Last Middle                   Professor, Harrison Ford R.

Deliberate wrong information

Securing privacy:-

As a customer, I want to use your service, but you want me to register. I will provide all the wrong information as long as it does not impact me. So to avoid unsolicited communication and not to share my telephone OR address for any ulterior purpose, I would fudge all the details. This generally happens for loyalty cards, web based registrations, 'get a prize' campaigns. Customer turns out to be smarter than the Database creators.

Trying to secure a service OR financial benefit

If one wants to get one more credit card from the same company OR apply for any scheme where only one application is allowed, you can fudge your records. This is done normally in those cases where the verification is done at the time of final allotment and issuance and not on the verification.

Avoid scrutiny OR past record

A black listed supplier can fill-up his proposal specifying his name and address in a different format. Or he can place a different name (it does not take long to float a new business entity) with the other things remaining the same.

Customer convenience vs. the data quality

You would like to have the data entry forms with en-marked boxes, and you can ask the customers to fill in black OR blue ink in capital letters. However, if customer has no obligation, chances are the he/she will not fill-up OR not follow the rules. In a non-obligatory situation, customer will fill-up the forms fast and in running hand-writing, good part of which could be illegible.

Lack of central ownership of the customer data

In any large enterprise, different business owners of geographies, channels and campaigns use different sources and database to acquire new clients. There is no central ownership of the customer data, OR even the data quality effort is sometime sliced at channel OR geography level. There are many examples where the customer (OR an employee OR a supplier) post-rejection at one location, can get acceptance at other location by making little change in the data.

TIP: Please refer data quality organization. One needs to assign a data custodian for having the business ownership of certain data group (like customer), and data steward to ensure the data quality of customer data. Data quality needs to be owned by business and to be supported by IT.

Time Decay

Customer marital status, number of dependents, income, address etc. get changed over time. There is no way that you can ensure the currency of the data without a sustained customer contact program and also through 'experiential ' indicators like return mail. You can also have 'intelligent' guesses like person might be having 20% higher income over five years OR if a person has crossed age of 30 he might have become married. (some of the examples may not be applicable across all cultures and locations). You can use these kind of guesses selectively to update customer data.

Technically correct vs. real life

A name in your social security card is seldom used in real-life and that gets reflected in the quality issues. The instances are as follows:

• Pronunciation – 'Sean' written as 'Shon' by the tele customer service executive.

• Nick Name – 'Willian' written as 'Bill' OR 'Robert' written as 'Bob' OR 'rob'

• Conventions – 'Barrett Junior' written as 'Barrett Jr.'

• Abbreviations – 'Ann Lemont Suez' written as 'A Lemont Suez' OR 'A L Suez'

• Shuffling middle, first and last name- 'Ann Lemont Suez' written as 'Lemont A Suez'        OR 'Ann L Suez'


I hope..above details will give you some insight on how to categorize the different kind of data issue.
Feel free to share you opinion.

Happy Cleansing....... :):)
Rohit...

Saturday, August 11, 2012

Data Profiling -- Industry view


Industry View of Data Profiling...

Well...today i wana talk somthing about data profiling.. in my words... :) not just copy paste from some other site.. lets see what's the standard definition of Data Profiling...


"Data profiling is the process of analyzing and exploring you data, to gain better insight
and to understand if there are inconsistencies or otherwise troublesome entries in your data
."



In short..its just analysis of data/records but smart analysis.. Why smart analysis..coz. there are many ways to analyze the data like..count, count based on some key.. max and min of all the columns...there precesion,width, doing slicing and dicing of data in different combination of columns and many more...more enough that i cant make list of now... ;-)



Profiling is just a concept. There is no hard and fast rule to do this. But in industry there are some standards. But i guess..these standards of profiling again set the standards of deep down analysis or profiling.



Industry point of view


I did work in 2, 3 profiling projects. First of all profiling is not real challenge.
The real challenge is to set the expectation with client, like what they want, in what way they wana see the data and what are insights they are looking for. And believe, this is the real challenge, coz event clients doesn’t know what they want.... :):)

Let me make a list of how to set the initials standard.



1. In what way they are giving the dump of data [in Excel, relational table or just CSV file].

2. What are different relation between the table, coz whatever the analysis, you have to do, those are somehow related with the business [parents table or child table], client doesn’t wana see just junk of data or digits, that doesn’t make any sense. Whatever the deliverables, you will give, that should be some kind of Eye Opener to them.

3.  Set the standards of different attributes through the entire table, like their name, width, precision should be consistent. Just take the signoff from them, otherwise, their will be lots of trouble like cat & mouse game will be waiting at the other corner... :)

4. Just narrow down the approach with client, how they wana see the profiling results and what is their objective from this. It’s just like just put yourself in their shoes, they all the data/records will start making sense to you automatically.

5. Just take the confirmation that, do you have to do this profiling stuff on some certain set of data or do you going to receive more data based on some periodic interval.

6. Last but not least, what is the volume of data, you are going to deal with, other wise profiling stuff will become to hectic, when you will start.

I guess above list is enough for starting the profiling mission..yessssss....its a friendly war against data...LOLz

Now my next focus point is how to achive them.

How & what

This is also very important aspect of profiling. let me tell you why. Different company uses different way of profiling.


PJ :- i have seen like, Big company directly opt for some profiling tool like IDQ or Talend open source or Back Office associate. But in small company, they won’t bother about the tool, they simply ask their developer to do this by using SQL queries. But one fact is those Small company developer knows much more about this analysis than big gig's analyst, coz they see data more clearly.

So, the question is which one is more efficient, some smart tool or our basic old friendly SQL queries.

Ii guess, I also don’t have perfect answer for this...geezzzzzz

In my opinion, it depends on the time, energy concept [same old concept]

But I notice one thing in my experiencing of profiling, that you have to use SQL queries sooner or later, coz there are lots
untold secret of data, that can't be revealed by any of the smart tool. Those untold facts are totally based on the projects.



------Now my battery is going down. will come with some more stuff on this in future. If any one has any opinion in or against, they are welcome... :)-------


Thursday, August 2, 2012

Informatica 8 and 9 & Features of Informatica-9

Difference between Informatica 8 and 9
Architecture wise there are no differences between 8 and 9 but here are some new features added in powercenter 9

Lookup Transformation:  Cache updates. We can update the lookup cache based on the results of an expression. When an expression is true, we can add to or update the lookup cache. We can update the dynamic lookup cache with the results of an expression.

Multiple rows return:  We can configure the Lookup transformation to return all rows that match a lookup condition. A Lookup transformation is an active transformation when it can return more than one row for any given input row.

SQL overrides for uncached lookups: In previous versions we could create a SQL override for cached lookups only. We can create an SQL override for uncached lookup. We can include lookup ports in the SQL query.

Database deadlock resilience: In previous releases, when the Integration Service encountered a database deadlock during a lookup, the session failed. Effective in 9.0, the session will not fail. When a deadlock occurs, the Integration Service attempts to run the last statement in a lookup. We can configure the number of retry attempts and time period between attempts.

SQL transformation: Auto-commit for connections. We can enable auto-commit for each database connection. Each SQL statement in a query defines a transaction. A commit occurs when the SQL statement completes or the next statement is executed, whichever comes first.

Session Log files rollover: We can limit the size of session logs for real-time sessions. We can limit the size by time or by file size. We can also limit the number of log files for a session.

Passive transformation: We can configure the SQL transformation to run in passive mode instead of active mode. When the SQL transformation runs in passive mode, the SQL transformation returns one output row for each input row.


XML transformation: XML Parser buffer validation. The XML Parser transformation can validate an XML document against a schema. The XML Parser transformation routes invalid XML to an error port. When the XML is not valid, the XML Parser transformation routes the XML and the error messages to a separate output group that We can connect to a target.

Model Repository Service: Application service that manages the Model repository. The Model repository is a relational database that stores the metadata for projects created in Informatica Analyst and Informatica Designer. The Model repository also stores run-time and configuration information for applications deployed to a Data.
Integration Service: Create and enable a Model Repository Service on the Domain tab of Informatica Administrator.

Connection management: Database connections are centralized in the domain. We can create and view database connections in Informatica Administrator, Informatica Developer, or Informatica Analyst. Create, view, edit, and grant permissions on database connections in Informatica Administrator.

Deployment: We can deploy, enable, and configure deployment units in the Informatica Administrator. Deploy Deployment units to one or more Data Integration Services. Create deployment units in Informatica Developer.

Monitoring: We can monitor profile jobs, scorecard jobs, preview jobs, mapping jobs, and SQL Data Services for each Data Integration Service. View the status of each monitored object on the Monitoring tab of Informatica Administrator.

New Features of Informatica-9

 

1. Informatica 9 supports data integration for the cloud as well as on premise. You can integrate the data in cloud applications, as well as run Informatica 9 on cloud infrastructure.

2. Informatica analyst is a new tool available in Informatica 9.

3. There is architectural difference in Informatica 9 compared to previous version.

4. Browser based tool for business analyst is a new feature.

5. Data steward is a new feature.

6. Allows unified administration with a new admin console that enables you to manage power centre and power exchange from the same console.

7. Powerful new capabilities for data quality.

8. Single admin console for data quality, power centre, power exchange and data services.

9. In Informatica 9, Informatica data quality (IDQ) has been further integrated with the Informatica Platform and performance, manageability and reusability have all been significantly enhanced.

10. The mappings rules are shared between the browser based tool for analysts and the eclipse based development leveraging unified metadata underneath.

11. The data services capabilities in Informatica 9 , both over sql and web services ,can be used for real time dash boarding.

12. Informatica data quality provides worldwide address validation support with integrated geo coding.

13. The ability to define rules and view and run profiles is available in both the Informatica developer (Thick client) and Informatica analyst (browser based tool-Thin client).these tools sit on a unified metadata infrastructure. Both tools incorporate security features like authentication and authorization ensuring..

14. The developer tool is now eclipse based and supports both data integration and data quality for enhanced productivity. It provides browser based tool for analysts to support the types of tasks they engage in, such as profiling data, specifying and validating rules & monitoring data quality.

15. There will a velocity methodology. Soon it’s going to introduce on I9.

16. Informatica has the capability to pull data from IMS, DB2 on series and series and from other several other legacy systems (Mainframe) environment like VSAM, Datacom, and IDMS etc.

17. There are separate tools available for different roles. The Mapping architect for Vision tool is designed for architects and developers to create templates for common data integration patterns saving developer’s tremendous amount of time.

18. Informatica 9 does not include ESB infrastructure.

19. Informatica supports open interfaces such as web services and can integrate with other tools that support these as well including BPM tool.

20. Informatica 9 complements existing BI architectures by providing immediate access to data through data virtualization, which can supplement the data in existing data warehouse and operational data store.

21. Informatica 9 supports profiling of Mainframe data. Leveraging the Informatica platform’s connectivity to Mainframe sources.

22. Informatica 9 will continue support feature of running the same workflow simultaneously.

23. Eclipse based environment is build for developers.

24. Browser based tool is a fully functional interface for business analysts.

25. Dashboards are designed for business executives.

26. There are 3 interfaces through which these capabilities can be accessed. Analyst tool is a browsed tool for analyst and stewards. Developers can use the eclipse based developer tool. Line of business managers can view data quality scorecards.