Monday, August 20, 2012

Way of Profiling Analysis

Basically 3 kind of analysis.
1.       Business Rule
2.       Functional Depedency
3.       Column set analysis


Aggregation
The Aggregation tab displays all the essential measures for each column, such as minimum and maximum values, number of distinct values,
and null values. Some measures are available only for specific data types. These include the average, median, and standard deviation measures.
Information can be viewed from either the Tabular or Graphical subtabs.


Measurement                   Description
               
Column                                Name of the column, within the profiled data object, for which  data profiling determined the aggregation measures
Minimum                            Minimum value for the inherent database ordering for the column
Maximum                           Maximum value for the inherent database ordering of the column
#Distinct                              Total number of distinct values for the column
% Distinct                            Percentage of distinct values in the column over the entire row set
Not Null                               Indicates if a NOT NULL constraint is defined in the database for the column
RecommendedNOT NULL Indicates if data profiling results recommend that the column enable null values. A value of Yes represents a recommendation that this column should not enable null values
#Nulls                                   Total number of null values for the column
%Nulls                                  Percentage of null values, for the column, over the entire row set
Six Sigma                             Number of null values (defects) to the total number of rows in the table (opportunities)
Average                               Average value, for the column, in the entire row set
Median                                Median value, for the column, in the entire row set
Std Dev                                Standard deviation for the column



Data Type

The Data Type tab provides profiling results about data types. This includes metrics such as length for character data types and the precision and scale for numeric data types. For each data type that is discovered, the data type is compared to the dominant data type found in the entire attribute and the percentage of rows that follows the dominant measure is listed. One example of data type profiling would be finding a column defined as VARCHAR that stores only numeric values.  Changing the data type of the column to NUMBER would make storage and processing more efficient.
  
Measurement                   Description
Columns                              Name of the column, within the data object, for which data type   analysis was performed
Documented Data Type                Data type of the column in the source object
Dominant Data Type       From analyzing the column values, data profiling determines that this is the dominant (most frequent) data type.
%Dominant Data Type   Percentage of total number of rows where column value has the dominant data type
Documented Length      Length of the data type in the source object
Minimum Length             Minimum length of the data stored in the column
Maximum Length            Maximum length of the data stored in the column
Dominant Length             From analyzing the column values, data profiling determines that this is the dominant (most frequent) length.
%Dominant Length         Percentage of total number of rows where column value has the dominant length
Documented Precision  Precision of the data type in the source object
Minimum Precision         Minimum precision for the column in the source object
Maximum Precision        Maximum precision for the column in the source object
Dominant Precision         From analyzing the column values, data profiling determines that this is the dominant (most frequent) precision
%Dominant Precision     Percentage of total number of rows where column value has the dominant                                                        precision
Documented Scale          Scale specified for the data type in the source object
Minimum Scale                 Minimum scale of the data type in the source object
Maximum Scale                                Maximum scale of the data type in the source object
Dominant Scale                 From analyzing the column values, data profiling determines that this is the                                                        dominant (most frequent) scale.
%Dominant Scale             Percentage of total number of rows where column value has the dominant scale.

Domain
The Domain tab displays results about the possible set of values that exist in a certain attribute. Information can be viewed from either the Tabular or Graphical subtabs.

The process of discovering a domain on a column involves two phases. First, the distinct values in the column are used to determine whether that column might be defined by a domain. Typically, there are few distinct values in a domain. Then, if a potential domain is identified, the count of distinct values is used to determine whether that distinct value is compliant with the domain. The properties that control the threshold for both phases of domain discovery can be set in the Property Inspector. If you find a result to know more about, drill down and use the Data Drill panel to view details about the cause of the result. For example, a domain of four values was found for the column REGION_ID: 3,2,4, and 1. To see which records contributed to this finding, select the REGION_ID row and view the details in the Data Drill panel.

Domain Results:-

Measurement                   Description

Column                              Name of the column for which domain discovery was performed
Found Domain                  The discovered domain values
% Compliant                       The percentage of all column values that are compliant with the  discovered domain values
Six Sigma                             The Six Sigma value for the domain results

Pattern
The Pattern tab displays information discovered about patterns within the attribute. Pattern discovery is the profiler's attempt at generating regular expressions for data that it discovered for a specific attribute. Non-English characters are not supported in the pattern discovery process.

Pattern Results:-

Measurement                                   Description
Columns                                              Name of the column for which pattern results were discovered
Dominant Character Pattern       The most frequently discovered character pattern or   consensus pattern
% Compliant                                       The percentage of rows whose data pattern agrees with the dominant character pattern
Dominant Word Pattern               The most frequently discovered word character pattern or consensus pattern
% Compliant                                       The percentage of rows whose data pattern agrees with the dominant word pattern
Common Format                              Name, Address, Date, Boolean, Social Security Number, E-mail, URL. This is the profiler's attempt to add semantic understanding to the data that it sees. Based on patterns and some other techniques, it tries to determine which domain bucket a certain attribute's data belongs to.
% Compliant                                       The percentage of rows whose data pattern agrees with the consensus common format pattern

Unique Key
The Unique Key tab provides information about the existing unique keys that were documented in the data dictionary, and possible unique keys or key combinations that were detected by the data profiling operation. The uniqueness % is shown for each. The unique keys that have No in the Documented column are the ones that are discovered by data profiling.

For example, a phone number is unique in 98% of the records. It can be a unique key candidate, and you can then cleanse the noncompliant records. You can also use the drill-down feature to view the cause of the duplicate phone numbers in the Data Drill panel. Table 18-7 describes the various measurement results available in the Unique Key tab.

Measurement                                   Description
Unique Key                                        The discovered unique key
Documented                                     Indicates if a unique key on the column exists in the data dictionary.
                                                                A value of Yes indicates that a unique key exists in the    data dictionary.
A value of No indicates that the unique key was discovered because of data profiling.
Discovered                                         From analyzing the column values, data profiling determines whether a unique key should be created on the column listed in the Local Attribute(s) column.
Local Attribute(s)                             The name of the column in the data object that was profiled.
# Unique                                             The number of rows, in the source object, in which the attribute represented by Local Attribute is unique
% Unique                                            The percentage of rows, in the source object, for which the attribute represented by Local Attribute is unique
Six Sigma                                             Number of null values (defects) to the total number of rows in the table (opportunities)



Functional Dependency
The Functional Dependency tab displays information about the attribute or attributes that seem to depend on or determine other attributes. Information can be viewed from either the Tabular or Graphical subtabs. You can use the Show list to change the focus of the report. Unique keys defined in the database are not discovered as functional dependencies during data profiling.

Functional Dependency Results:-

Measurement                                   Description
Determinant                                      Name of the attribute that is found to determine the attribute listed under Dependent
Dependent                                         Name of the attribute that is found to be determined by the value of another attribute
# Defects                                             Number of values in the Determinant attribute that were not determined by the Dependent attribute
% Compliant                                       Percentage of values that are compliant with the discovered dependency
Six Sigma                                            Six Sigma value
Type                                                      Type of functional dependency. Possible values are unidirectional or bidirectional           


Referential
The Referential tab displays information about foreign keys that were documented in the data dictionary, and relationships discovered during profiling.
For each relationship, you can see the level of compliance. Information can be viewed from both the Tabular and Graphical subtabs.  In addition, two other subtabs are available only in the Referential tab: Joins and Redundant Columns.

Referential Results:-

Measurement                                   Description
Relationship                                       Name of the relationship
Type                                                      Type of relationship. The possible values are Row Relationship and Foreign Key.
Documented                                     Indicates if a foreign key exists on the column in the data dictionary, Yes indicates that a foreign key on the column exists in the data dictionary. No indicates that the foreign key was discovered because of data profiling.
Discovered                                         From analyzing the column values, data profiling determines whether a foreign key should be created on the column represented by Local Attribute(s).
Local Attribute(s)                             Name of the attribute in the source object
Remote Key                                       Name of the key in the referenced object to which the local attribute refers
Remote Attribute(s)                       Name of the attributes in the referenced object
Remote Relation                              Name of the object that the source object references
Remote Module                               Name of the module that contains the referenced object
Cardinality Range                             Range of the cardinality between two attributes.For example, the EMP table contains 5 rows of employee data.
There are two employees each in department 10 and 20 and one employee in department 30.
The DEPT table contains three rows of department data with deptno value 10, 20, and 30.

Data profiling finds a row relationship between the EMP and DEPT tables. The cardinality range is 1-2:1-1.
This is because in EMP, the number of rows per distinct value ranges from 1 (for deptno 30) to 2 (deptno 10 and 20).
                                                                In DEPT, there is only one row for each distinct value (10, 20, and 30).

# Orphans                                           Number of orphan rows in the source object
% Compliant                                       Percentage of values that are compliant with the discovered dependency
Six Sigma                                             Number of null values (defects) to the total number of rows in the table (opportunities)


Note:- The Joins subtab displays a join analysis on the reference selected in the Reference list.
The results show the relative size and exact counts of the three possible outcomes for referential relationships: joins, orphans, and childless objects.

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home