Dirty Data
January 26, 2011 Leave a comment
In the business world, we get data in all kinds of formats and ways. Often times this data is what I call dirty and in need of cleaning. The most common culprits can be easily fixed. Below are some of them and my most oft-used remedies.
Weird space at the end (or beginning) of text in field
Sometimes I download data, copy and paste it from another source, or get it directly from a database. And, for some reason, there are fields or cells with spaces at the end. My experience tells me that this is usually one of two things. Either somehow (often data entry error) a person typed a space at the end or beginning on accident. Or non printable characters (such as carriage returns or the tab key) have been appended. The latter often happens when converting a tab delimited file.
To remedy, I combine two functions:
Clean: Removes all nonprintable characters from text
Trim: Removes all spaces from a text string except for single spaces between words
Combining the two: =Clean(Trim(textString))
Unwanted characters in fields
Adding data validation to fields is not hard to do. I’ll post on that later as it relates to Excel and Access. But, in the meantime, we’ll always get data that is not standardized and often dirty. (There’s that word again). Some examples: phone numbers formatted in several different ways (with dashes, periods, parenthesis, etc.); Social Security Numbers (some with dashes, some without); Birth Dates (using different formats). An easy way to remove unwanted characters, when you know what those characters are, is to use the Substitute() function. For example, if cell A1 contains “888-555-1212″, you could remove the dashes with SUBSTITUTE(A1,”-“,””). If, instead, A1 contains “(888) 555-1212″, you could remove all non numeric values with nested (a function within a function) Substitute functions:
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1,”-“,””),”(“,””),”)”,””),” “,””)
YUCK! Notice how you have to do a separate substitute function for every unwanted character. Very cumbersome. There is another way, if you are able to get over the fear of using VBA (Visual Basic for Applications; See an earlier post for a quick note on overcoming that fear).
Using VBA to include only the values you want.
This one is pretty simple. The code can look confusing, but you can find examples and help like this easily on the web. The world is eager to help you if you just ask. By writing this custom function, you’ll be able to easily clean any data so that it includes only the values you want. The example provided filters to include only alphanumeric characters, but you could add or delete any that you wish. Once done, you can simply use this new CleanString function [CleanString(stringToBeCleaned)] just like you would any of the built-in ones Excel provides.