Data Profiling -- Industry view
Industry View of Data Profiling...
Well...today i wana talk somthing about data profiling.. in my words... :) not just copy paste from some other site.. lets see what's the standard definition of Data Profiling...
"Data profiling is the process of analyzing and exploring you data, to gain better insight
and to understand if there are inconsistencies or otherwise troublesome entries in your data."
In short..its just analysis of data/records but smart analysis.. Why smart analysis..coz. there are many ways to analyze the data like..count, count based on some key.. max and min of all the columns...there precesion,width, doing slicing and dicing of data in different combination of columns and many more...more enough that i cant make list of now... ;-)
Profiling is just a concept. There is no hard and fast rule to do this. But in industry there are some standards. But i guess..these standards of profiling again set the standards of deep down analysis or profiling.
Industry point of view
I did work in 2, 3 profiling projects. First of all profiling is not real challenge.
The real challenge is to set the expectation with client, like what they want, in what way they wana see the data and what are insights they are looking for. And believe, this is the real challenge, coz event clients doesn’t know what they want.... :):)
Let me make a list of how to set the initials standard.
1. In what way they are giving the dump of data
[in Excel, relational table or just CSV file].
2. What are different relation between the table, coz whatever the analysis, you have to do, those are somehow related with the business [parents table or child table], client doesn’t wana see just junk of data or digits, that doesn’t make any sense. Whatever the deliverables, you will give, that should be some kind of Eye Opener to them.
3. Set the standards of different attributes through the entire table, like their name, width, precision should be consistent. Just take the signoff from them, otherwise, their will be lots of trouble like cat & mouse game will be waiting at the other corner... :)
4. Just narrow down the approach with client, how they wana see the profiling results and what is their objective from this. It’s just like just put yourself in their shoes, they all the data/records will start making sense to you automatically.
5. Just take the confirmation that, do you have to do this profiling stuff on some certain set of data or do you going to receive more data based on some periodic interval.
6. Last but not least, what is the volume of data, you are going to deal with, other wise profiling stuff will become to hectic, when you will start.
2. What are different relation between the table, coz whatever the analysis, you have to do, those are somehow related with the business [parents table or child table], client doesn’t wana see just junk of data or digits, that doesn’t make any sense. Whatever the deliverables, you will give, that should be some kind of Eye Opener to them.
3. Set the standards of different attributes through the entire table, like their name, width, precision should be consistent. Just take the signoff from them, otherwise, their will be lots of trouble like cat & mouse game will be waiting at the other corner... :)
4. Just narrow down the approach with client, how they wana see the profiling results and what is their objective from this. It’s just like just put yourself in their shoes, they all the data/records will start making sense to you automatically.
5. Just take the confirmation that, do you have to do this profiling stuff on some certain set of data or do you going to receive more data based on some periodic interval.
6. Last but not least, what is the volume of data, you are going to deal with, other wise profiling stuff will become to hectic, when you will start.
I guess above list is enough for starting the profiling mission..yessssss....its a friendly war against data...LOLz
Now my next focus point is how to achive them.
How & what
This is also very important aspect of profiling. let me tell you why. Different company uses different way of profiling.
PJ :- i have seen like, Big company directly opt for some profiling tool like IDQ or Talend open source or Back Office associate. But in small company, they won’t bother about the tool, they simply ask their developer to do this by using SQL queries. But one fact is those Small company developer knows much more about this analysis than big gig's analyst, coz they see data more clearly.
So, the question is which one is more efficient, some smart tool or our basic old friendly SQL queries.
Ii guess, I also don’t have perfect answer for this...geezzzzzz
In my opinion, it depends on the time, energy concept [same old concept]
But I notice one thing in my experiencing of profiling, that you have to use SQL queries sooner or later, coz there are lots
untold secret of data, that can't be revealed by any of the smart tool. Those untold facts are totally based on the projects.
------Now my battery is going down. will come with some more stuff on this in future. If any one has any opinion in or against, they are welcome... :)-------
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home