LostInTechnicality: Way of Profiling Analysis

Basically 3 kind of analysis.

1. Business Rule

2. Functional Depedency

3. Column set analysis

Aggregation

The Aggregation tab displays all the essential measures for each column, such as minimum and maximum values, number of distinct values,

and null values. Some measures are available only for specific data types. These include the average, median, and standard deviation measures.

Information can be viewed from either the Tabular or Graphical subtabs.

Measurement Description

Column Name of the column, within the profiled data object, for which data profiling determined the aggregation measures

Minimum Minimum value for the inherent database ordering for the column

Maximum Maximum value for the inherent database ordering of the column

#Distinct Total number of distinct values for the column

% Distinct Percentage of distinct values in the column over the entire row set

Not Null Indicates if a NOT NULL constraint is defined in the database for the column

RecommendedNOT NULL Indicates if data profiling results recommend that the column enable null values. A value of Yes represents a recommendation that this column should not enable null values

#Nulls Total number of null values for the column

%Nulls Percentage of null values, for the column, over the entire row set

Six Sigma Number of null values (defects) to the total number of rows in the table (opportunities)

Average Average value, for the column, in the entire row set

Median Median value, for the column, in the entire row set

Std Dev Standard deviation for the column

Data Type

The Data Type tab provides profiling results about data types. This includes metrics such as length for character data types and the precision and scale for numeric data types. For each data type that is discovered, the data type is compared to the dominant data type found in the entire attribute and the percentage of rows that follows the dominant measure is listed. One example of data type profiling would be finding a column defined as VARCHAR that stores only numeric values. Changing the data type of the column to NUMBER would make storage and processing more efficient.

Measurement Description

Columns Name of the column, within the data object, for which data type analysis was performed

Documented Data Type Data type of the column in the source object

Dominant Data Type From analyzing the column values, data profiling determines that this is the dominant (most frequent) data type.

%Dominant Data Type Percentage of total number of rows where column value has the dominant data type

Documented Length Length of the data type in the source object

Minimum Length Minimum length of the data stored in the column

Maximum Length Maximum length of the data stored in the column

Dominant Length From analyzing the column values, data profiling determines that this is the dominant (most frequent) length.

%Dominant Length Percentage of total number of rows where column value has the dominant length

Documented Precision Precision of the data type in the source object

Minimum Precision Minimum precision for the column in the source object

Maximum Precision Maximum precision for the column in the source object

Dominant Precision From analyzing the column values, data profiling determines that this is the dominant (most frequent) precision

%Dominant Precision Percentage of total number of rows where column value has the dominant precision

Documented Scale Scale specified for the data type in the source object

Minimum Scale Minimum scale of the data type in the source object

Maximum Scale Maximum scale of the data type in the source object

Dominant Scale From analyzing the column values, data profiling determines that this is the dominant (most frequent) scale.

%Dominant Scale Percentage of total number of rows where column value has the dominant scale.

Domain

The Domain tab displays results about the possible set of values that exist in a certain attribute. Information can be viewed from either the Tabular or Graphical subtabs.

The process of discovering a domain on a column involves two phases. First, the distinct values in the column are used to determine whether that column might be defined by a domain. Typically, there are few distinct values in a domain. Then, if a potential domain is identified, the count of distinct values is used to determine whether that distinct value is compliant with the domain. The properties that control the threshold for both phases of domain discovery can be set in the Property Inspector. If you find a result to know more about, drill down and use the Data Drill panel to view details about the cause of the result. For example, a domain of four values was found for the column REGION_ID: 3,2,4, and 1. To see which records contributed to this finding, select the REGION_ID row and view the details in the Data Drill panel.

Domain Results:-

Measurement Description

Column Name of the column for which domain discovery was performed

Found Domain The discovered domain values

% Compliant The percentage of all column values that are compliant with the discovered domain values

Six Sigma The Six Sigma value for the domain results

Pattern

The Pattern tab displays information discovered about patterns within the attribute. Pattern discovery is the profiler's attempt at generating regular expressions for data that it discovered for a specific attribute. Non-English characters are not supported in the pattern discovery process.

Pattern Results:-

Measurement Description

Columns Name of the column for which pattern results were discovered

Dominant Character Pattern The most frequently discovered character pattern or consensus pattern

% Compliant The percentage of rows whose data pattern agrees with the dominant character pattern

Dominant Word Pattern The most frequently discovered word character pattern or consensus pattern

% Compliant The percentage of rows whose data pattern agrees with the dominant word pattern

Common Format Name, Address, Date, Boolean, Social Security Number, E-mail, URL. This is the profiler's attempt to add semantic understanding to the data that it sees. Based on patterns and some other techniques, it tries to determine which domain bucket a certain attribute's data belongs to.

% Compliant The percentage of rows whose data pattern agrees with the consensus common format pattern

Unique Key

The Unique Key tab provides information about the existing unique keys that were documented in the data dictionary, and possible unique keys or key combinations that were detected by the data profiling operation. The uniqueness % is shown for each. The unique keys that have No in the Documented column are the ones that are discovered by data profiling.

For example, a phone number is unique in 98% of the records. It can be a unique key candidate, and you can then cleanse the noncompliant records. You can also use the drill-down feature to view the cause of the duplicate phone numbers in the Data Drill panel. Table 18-7 describes the various measurement results available in the Unique Key tab.

Measurement Description

Unique Key The discovered unique key

Documented Indicates if a unique key on the column exists in the data dictionary.

A value of Yes indicates that a unique key exists in the data dictionary.

A value of No indicates that the unique key was discovered because of data profiling.

Discovered From analyzing the column values, data profiling determines whether a unique key should be created on the column listed in the Local Attribute(s) column.

Local Attribute(s) The name of the column in the data object that was profiled.

# Unique The number of rows, in the source object, in which the attribute represented by Local Attribute is unique

% Unique The percentage of rows, in the source object, for which the attribute represented by Local Attribute is unique

Six Sigma Number of null values (defects) to the total number of rows in the table (opportunities)

Functional Dependency

The Functional Dependency tab displays information about the attribute or attributes that seem to depend on or determine other attributes. Information can be viewed from either the Tabular or Graphical subtabs. You can use the Show list to change the focus of the report. Unique keys defined in the database are not discovered as functional dependencies during data profiling.

Functional Dependency Results:-

Measurement Description

Determinant Name of the attribute that is found to determine the attribute listed under Dependent

Dependent Name of the attribute that is found to be determined by the value of another attribute

# Defects Number of values in the Determinant attribute that were not determined by the Dependent attribute

% Compliant Percentage of values that are compliant with the discovered dependency

Six Sigma Six Sigma value

Type Type of functional dependency. Possible values are unidirectional or bidirectional

Referential

The Referential tab displays information about foreign keys that were documented in the data dictionary, and relationships discovered during profiling.

For each relationship, you can see the level of compliance. Information can be viewed from both the Tabular and Graphical subtabs. In addition, two other subtabs are available only in the Referential tab: Joins and Redundant Columns.

Referential Results:-

Measurement Description

Relationship Name of the relationship

Type Type of relationship. The possible values are Row Relationship and Foreign Key.

Documented Indicates if a foreign key exists on the column in the data dictionary, Yes indicates that a foreign key on the column exists in the data dictionary. No indicates that the foreign key was discovered because of data profiling.

Discovered From analyzing the column values, data profiling determines whether a foreign key should be created on the column represented by Local Attribute(s).

Local Attribute(s) Name of the attribute in the source object

Remote Key Name of the key in the referenced object to which the local attribute refers

Remote Attribute(s) Name of the attributes in the referenced object

Remote Relation Name of the object that the source object references

Remote Module Name of the module that contains the referenced object

Cardinality Range Range of the cardinality between two attributes.For example, the EMP table contains 5 rows of employee data.

There are two employees each in department 10 and 20 and one employee in department 30.

The DEPT table contains three rows of department data with deptno value 10, 20, and 30.

Data profiling finds a row relationship between the EMP and DEPT tables. The cardinality range is 1-2:1-1.

This is because in EMP, the number of rows per distinct value ranges from 1 (for deptno 30) to 2 (deptno 10 and 20).

In DEPT, there is only one row for each distinct value (10, 20, and 30).

# Orphans Number of orphan rows in the source object

% Compliant Percentage of values that are compliant with the discovered dependency

Six Sigma Number of null values (defects) to the total number of rows in the table (opportunities)

Note:- The Joins subtab displays a join analysis on the reference selected in the Reference list.

The results show the relative size and exact counts of the three possible outcomes for referential relationships: joins, orphans, and childless objects.

LostInTechnicality

Monday, August 20, 2012

Way of Profiling Analysis

0 Comments:

Post a Comment

About Me

Previous Posts