Basic Data Cleansing Issue
Today I thought to write something about the major issues we
face while data cleansing. Profiling is just the starting point. It gives the
overview of all data discrepancies, but for cleansing purpose, we need to take
a close look at the records, try to find out the patterns. And to find out the
pattern, you need to think from the real life scenarios, what could be the
problem. These records are gathered from different sources and god knows only,
how many years it has taken to build this much of data.
So in today article, I gathered some info from different
websites. Below article is not about how to cleanse the data. Its about, how to
identify the bad records and you can gather more meaningful stats of bad data
and categorize into different sections,
One more point, I want to highlight, that these records
could be of any domain like customer, vendor, product etc. and also these all
records belongs to different countries and all the countries has their own way
of formatting the fields like, name, address, state, Zip Code and many
different fields. And the main fight is about to bring everything on the same
page, in one consistent format.
There will be sometime, you will think like, some field
containing special symbols might be wrong data, but its not. But while
finalizing the validation rules for each fields, it depends on the client, how
he want to see the records.
Below article is based on Customer domain.
Customer Data Challenges
Customer Data Quality has additional set of challenges over
and above the typical data quality issues
Names, when spoken, written and especially when entered into
a computer system are subject to considerable variation and error. Although
this error and variation can be reduced, it can't be entirely eliminated. Data
such as dates of birth, ages, phone numbers and identity numbers are all used
and all subject to unavoidable error and variation. Please refer to reasons for
data quality challenges along with the following points. Along with these
challenges, we have given the TIPs to avoid some of such issues. In real-life,
you can only reduce but cannot eliminate these causes.
General Life can go on with bad customer data
In fact, people can cope quite well with errors in their
name. In the case of addressing variations, postal workers also still manage to
deliver the mail despite some horrendous errors., but how do computer systems
cope with such error and variation specifically, how do they competently search
and match this type of data? The answer is- if not given the appropriate
attention, not all that well.
Impossible to define all rules for correct names and
customer addresses
This is fundamentally because data integrity in computer
systems relies on rules, constraints and reference data to enforce data
accuracy. It is easy to see how error and
variation creeps in because:
• Names and addresses easily disobey rules and constraints,
without the person entering the data even being aware.
• There are no complete dictionaries for names and addresses
for computer systems to use.
The vocabulary in use for people’s first names includes in
excess of 2,500,000 words in the USA
alone. Yet as much as 80% of the population may have names from as few as 500
words. Family names are just as unevenly distributed.
TIP- Many systems have online search capability based upon
the search algorithms, whereby they apply the algorithms of customer data
search and matching and identify the possible matches to the name. It will
allow you to check on the possible list, and see if the customer already
exists.
Data Entry, Data Model and Processing Issues
Lack of Granularity: example- As Data model did not
account for it, the address, telephone, ZIP are all clustered in a single
field.
Lack of realism: example- The entry makes it
mandatory to enter the middle name OR telephone OR Fax number, which in real
world may not exist, leading to wrong entry.
Data Domain Leniency: Example-
City is not selected out of a drop
down, but is allowed free form leading to NY/New York/York .
Simple Data Entry Mistake
Example- Enter Fenmark instead of Denmark
Right field, but wrong purpose
Example- Enter street name in district, city in state field
Spill over data
Example- The name is too long and it spills over into the
address field OR address field spills over into the state field.
Variations in Abbreviations
Example- The Data entry person OR the person go by his /her
flow of abbreviations.
Avenue-AV-AVN/ WDC , WASH
DC.
Mis-spellings due to Phonetics similarities
Example- This happens primarily during recordings of
Telephone conversation.
TIP- Most of the above can be included in the input
controls. These kind of controls like drop down of city, state etc, can make
the page heavy as the data related to these drop downs is part of the page.
However, given todays network and broad-band capacities, it does not have an
impact on the response time.
Total control and training is not possible
When talking about an organization’s internal systems,
attempts are made to control this problem at its source through the management
and training of the people who enter OR use the data. The other way to control
is through the design of the forms, screens and processes that capture it.
There is a lot of discussion, theory and good practical advice available on
this subject.
Unfortunately, the person entering the data OR the data user may be totally unaware that he is entering data that contains some error and variation. Also, not all situations allow this level of control. In Web based systems the person entering the data is usually outside the normal controls of the organization. If we hope to entice international customers OR interest, the web forms themselves need to allow entry of varying data formats without rigid constraints. Similarly, the quality of data files sourced from a third party is usually out of the control of the recipient organization.
Lack of Data Standards and Formats
This has been covered at length in Data Quality Assurance.
When you have different formats used by different systems OR even by different
modules in the same system, it leads to problem in matching.
Field format
Field
data
First-middle-last-title
Harrison R Ford
Professor
Last-First -Middle comma title Ford Harrison R., Professor
Title- Comma- First- Last Middle Professor,
Harrison Ford R.
Deliberate wrong information
Securing privacy:-
As a customer, I want to use your service, but you want me
to register. I will provide all the wrong information as long as it does not
impact me. So to avoid unsolicited communication and not to share my telephone
OR address for any ulterior purpose, I would fudge all the details. This
generally happens for loyalty cards, web based registrations, 'get a prize'
campaigns. Customer turns out to be smarter than the Database creators.
Trying to secure a service OR financial benefit
If one wants to get one more credit card from the same
company OR apply for any scheme where only one application is allowed, you can
fudge your records. This is done normally in those cases where the verification
is done at the time of final allotment and issuance and not on the
verification.
Avoid scrutiny OR past record
A black listed supplier can fill-up his proposal specifying
his name and address in a different format. Or he can place a different name
(it does not take long to float a new business entity) with the other things
remaining the same.
Customer convenience vs. the data quality
You would like to have the data entry forms with en-marked
boxes, and you can ask the customers to fill in black OR blue ink in capital
letters. However, if customer has no obligation, chances are the he/she will
not fill-up OR not follow the rules. In a non-obligatory situation, customer
will fill-up the forms fast and in running hand-writing, good part of which
could be illegible.
Lack of central ownership of the customer data
In any large enterprise, different business owners of
geographies, channels and campaigns use different sources and database to
acquire new clients. There is no central ownership of the customer data, OR
even the data quality effort is sometime sliced at channel OR geography level.
There are many examples where the customer (OR an employee OR a supplier)
post-rejection at one location, can get acceptance at other location by making
little change in the data.
TIP: Please refer data quality organization. One needs to assign a data custodian for having the business ownership of certain data group (like customer), and data steward to ensure the data quality of customer data. Data quality needs to be owned by business and to be supported by IT.
Time Decay
Customer marital status, number of dependents, income,
address etc. get changed over time. There is no way that you can ensure the
currency of the data without a sustained customer contact program and also
through 'experiential ' indicators like return mail. You can also have 'intelligent'
guesses like person might be having 20% higher income over five years OR if a
person has crossed age of 30 he might have become married. (some of the
examples may not be applicable across all cultures and locations). You can use
these kind of guesses selectively to update customer data.
Technically correct vs. real life
A name in your social security card is seldom used in
real-life and that gets reflected in the quality issues. The instances are as
follows:
• Pronunciation – 'Sean' written as 'Shon' by the tele
customer service executive.
• Nick Name – 'Willian' written as 'Bill' OR 'Robert'
written as 'Bob' OR 'rob'
• Conventions – 'Barrett Junior' written as 'Barrett Jr.'
• Abbreviations – 'Ann Lemont Suez' written as 'A Lemont
Suez' OR 'A L Suez'
• Shuffling middle, first and last name- 'Ann Lemont Suez'
written as 'Lemont A Suez' OR 'Ann
L Suez'
I hope..above details will give you some insight on how to categorize the different kind of data issue.
Feel free to share you opinion.
Happy Cleansing....... :):)
Rohit...
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home