Professional Documents
Culture Documents
These two simple rules turn 40,000 rows of phone numbers into a short list of different
phone number formats- and we can see which ones seem valid or seem to have problems.
By adding one more rule, eliminating a few characters that do not directly affect the phone
number, we can reduce the list even further.
The last rule ignores the space, open and close brackets and dash characters- and that
simplifies the different formats in the analysis to a mere five formats;
1.
2.
3.
4.
10 digits
11 digits
9 digits
A 7 character string (drilling down by clicking on this bar reveals the word
"unknown"- not so useful.)
5.
Missing.
Of course, data format profiling does not always give a final answer regarding data quality. In
this case, just because a phone number has the right number of digits, does not mean that it
is a valid phone number, and even if its a valid phone number it may not be the right phone
number for that customer...
Using data format patterns to examine the contents of a string column is a very useful way to
start to understand what's in the column. More than just a Yes-No result, it actually gives you
a visual look at what types of issues exist.
The analysis here was done using the Datamartist tool, an easy to use data profiler and data
transformation tool. To try making some data format rules for your own data, give the free
trial a try.
Up next in our data profiling series of blog posts- an even more powerful (although often
more complex) technique called "Regular Expressions" or "Regex". This specification
language can define complex rules that analyze strings and determine if they belong to a
given set (say, "Valid product codes" as in this example) or not.