Major Data Cleansing Challenge
Today I thought to write something about the major issues we face while data cleansing. Profiling is just the starting point. It gives the overview of all data discrepancies, but for cleansing purpose, we need to take a close look at the records, try to find out the patterns. And to find out the pattern, you need to think from the real life scenarios, what could be the problem. These records are gathered from different sources and god knows only, how many years it has taken to build this much of data.
So in today article, I gathered some info from different websites. Below article is not about how to cleanse the data. Its about, how to identify the bad records and you can gather more meaningful stats of bad data and categorize into different sections,
One more point, I want to highlight, that these records could be of any domain like customer, vendor, product etc. and also these all records belongs to different countries and all the countries has their own way of formatting the fields like, name, address, state, Zip Code and many different fields. And the main fight is about to bring everything on the same page, in one consistent format.
There will be sometime, you will think like, some field containing special symbols might be wrong data, but its not. But while finalizing the validation rules for each fields, it depends on the client, how he want to see the records.
Below article is based on Customer domain.
Customer Data Challenges
Customer Data Quality has additional set of challenges over and above the typical data quality issues
Names, when spoken, written and especially when entered into a computer system are subject to considerable variation and error. Although this error and variation can be reduced, it can't be entirely eliminated. Data such as dates of birth, ages, phone numbers and identity numbers are all used and all subject to unavoidable error and variation. Please refer to reasons for data quality challenges along with the following points. Along with these challenges, we have given the TIPs to avoid some of such issues. In real-life, you can only reduce but cannot eliminate these causes.
General Life can go on with bad customer data
In fact, people can cope quite well with errors in their name. In the case of addressing variations, postal workers also still manage to deliver the mail despite some horrendous errors., but how do computer systems cope with such error and variation specifically, how do they competently search and match this type of data? The answer is- if not given the appropriate attention, not all that well.
Impossible to define all rules for correct names and customer addresses
This is fundamentally because data integrity in computer systems relies on rules, constraints and reference data to enforce data accuracy. It is easy to see how error and
variation creeps in because:
• Names and addresses easily disobey rules and constraints, without the person entering the data even being aware.
• There are no complete dictionaries for names and addresses for computer systems to use.
The vocabulary in use for people’s first names includes in excess of 2,500,000 words in the USA alone. Yet as much as 80% of the population may have names from as few as 500 words. Family names are just as unevenly distributed.
TIP- Many systems have online search capability based upon the search algorithms, whereby they apply the algorithms of customer data search and matching and identify the possible matches to the name. It will allow you to check on the possible list, and see if the customer already exists.
Data Entry, Data Model and Processing Issues
Lack of Granularity: example- As Data model did not account for it, the address, telephone, ZIP are all clustered in a single field.
Lack of realism: example- The entry makes it mandatory to enter the middle name OR telephone OR Fax number, which in real world may not exist, leading to wrong entry.
Data Domain Leniency: Example- City is not selected out of a drop down, but is allowed free form leading to NY/New York/York .
Simple Data Entry Mistake
Example- Enter Fenmark instead of Denmark
Right field, but wrong purpose
Example- Enter street name in district, city in state field
Spill over data
Example- The name is too long and it spills over into the address field OR address field spills over into the state field.
Variations in Abbreviations
Example- The Data entry person OR the person go by his /her flow of abbreviations.
Avenue-AV-AVN/ WDC , WASH DC.
Mis-spellings due to Phonetics similarities
Example- This happens primarily during recordings of Telephone conversation.
TIP- Most of the above can be included in the input controls. These kind of controls like drop down of city, state etc, can make the page heavy as the data related to these drop downs is part of the page. However, given todays network and broad-band capacities, it does not have an impact on the response time.
Total control and training is not possible
When talking about an organization’s internal systems, attempts are made to control this problem at its source through the management and training of the people who enter OR use the data. The other way to control is through the design of the forms, screens and processes that capture it. There is a lot of discussion, theory and good practical advice available on this subject.
Unfortunately, the person entering the data OR the data user may be totally unaware that he is entering data that contains some error and variation. Also, not all situations allow this level of control. In Web based systems the person entering the data is usually outside the normal controls of the organization. If we hope to entice international customers OR interest, the web forms themselves need to allow entry of varying data formats without rigid constraints. Similarly, the quality of data files sourced from a third party is usually out of the control of the recipient organization.
Lack of Data Standards and Formats
This has been covered at length in Data Quality Assurance. When you have different formats used by different systems OR even by different modules in the same system, it leads to problem in matching.
Field format Field data
First-middle-last-title Harrison R Ford Professor
Last-First -Middle comma title Ford Harrison R., Professor
Title- Comma- First- Last Middle Professor, Harrison Ford R.
Deliberate wrong information
Securing privacy:-
As a customer, I want to use your service, but you want me to register. I will provide all the wrong information as long as it does not impact me. So to avoid unsolicited communication and not to share my telephone OR address for any ulterior purpose, I would fudge all the details. This generally happens for loyalty cards, web based registrations, 'get a prize' campaigns. Customer turns out to be smarter than the Database creators.
Trying to secure a service OR financial benefit
If one wants to get one more credit card from the same company OR apply for any scheme where only one application is allowed, you can fudge your records. This is done normally in those cases where the verification is done at the time of final allotment and issuance and not on the verification.
Avoid scrutiny OR past record
A black listed supplier can fill-up his proposal specifying his name and address in a different format. Or he can place a different name (it does not take long to float a new business entity) with the other things remaining the same.
Customer convenience vs. the data quality
You would like to have the data entry forms with en-marked boxes, and you can ask the customers to fill in black OR blue ink in capital letters. However, if customer has no obligation, chances are the he/she will not fill-up OR not follow the rules. In a non-obligatory situation, customer will fill-up the forms fast and in running hand-writing, good part of which could be illegible.
Lack of central ownership of the customer data
In any large enterprise, different business owners of geographies, channels and campaigns use different sources and database to acquire new clients. There is no central ownership of the customer data, OR even the data quality effort is sometime sliced at channel OR geography level. There are many examples where the customer (OR an employee OR a supplier) post-rejection at one location, can get acceptance at other location by making little change in the data.
TIP: Please refer data quality organization. One needs to assign a data custodian for having the business ownership of certain data group (like customer), and data steward to ensure the data quality of customer data. Data quality needs to be owned by business and to be supported by IT.
Time Decay
Customer marital status, number of dependents, income, address etc. get changed over time. There is no way that you can ensure the currency of the data without a sustained customer contact program and also through 'experiential ' indicators like return mail. You can also have 'intelligent' guesses like person might be having 20% higher income over five years OR if a person has crossed age of 30 he might have become married. (some of the examples may not be applicable across all cultures and locations). You can use these kind of guesses selectively to update customer data.
Technically correct vs. real life
A name in your social security card is seldom used in real-life and that gets reflected in the quality issues. The instances are as follows:
• Pronunciation – 'Sean' written as 'Shon' by the tele customer service executive.
• Nick Name – 'Willian' written as 'Bill' OR 'Robert' written as 'Bob' OR 'rob'
• Conventions – 'Barrett Junior' written as 'Barrett Jr.'
• Abbreviations – 'Ann Lemont Suez' written as 'A Lemont Suez' OR 'A L Suez'
• Shuffling middle, first and last name- 'Ann Lemont Suez' written as 'Lemont A Suez' OR 'Ann L Suez'