Dirty Data Again

It happened again.  I was working on a project for which I needed a quick list of all of the 50 states and their abbreviations.  I probably had it somewhere— I’m sure I’ve needed this information before—but I was too lazy to look for it.  If I was in 7th grade, I could probably have just typed them in order from memory.  But I’m not and I can’t.  So I did what any half-way smart person would do.  I googled “State Abbreviations”.  The result returned a lot of sites, as you’d expect.  I decided on using the one for our good old United States Postal Service: http://www.usps.com/ncsc/lookups/abbr_state.txt.

Just copy and paste into Excel and I’m good to go, right?  Wrong.  As is often the case, the data did not come wrapped in a nice little bow ready to use.  It is what I call “Dirty Data”.  Luckily, there is ALWAYS a way to easily— and usually quickly— clean it up.

When you copy and paste the data into Excel, you find that the name of the state and its abbreviation fall into the same cell.  And to make things worse, there is no real uniformity to it.  If they all had the same number of spaces between the name and abbreviation that might be helpful.  If there was a comma or other common character, I could work with that, too.  But I wasn’t so lucky.

But here’s how I cleaned it up.

First, I removed all the extra spaces— any space other than single spaces between words— using the Trim() function. This solved part of the problem.

Normally I’d then find the space in the cell, take the left part as the name of the state and the right as the abbreviation. For “California CA”, “Nevada NV”, and “Mississippi MS”; that would work. But how about “New Mexico NM” or “New York NY”? That won’t work.

So, like I do a lot, I did a little research online and found some possible solutions. Then, so I fully understood how to accomplish the task (in case I had a similar problem in the future) I worked my way into the problem in small pieces:

  1. First, I need to figure out if it’s a state with one or two spaces. I need to count how many spaces are in the string (the trimmed string). I do this by first removing all of the spaces from the string [substitute(string,” “,””)] then taking the length of that new string without spaces [len(substitute(string,” “,””))].
  2. I then subtract the length of that new string without spaces from the length of the original string ([len(string)]
  3. The result of this subtraction gives me the number of spaces in the string— state and state abbreviation; this total also equals the last occurrence of a space in the string.

Here’s where it gets a little tricky…

  1. Now that we know where the last space is, we replace that with any non alpha character other than a space. We do this to differentiate it from the other space if there is one. For Example, if we used the dollar sign as our charcter,”New York NY” would become “New York$NY”. We do this by using the Substitute() function again. But this time we take advantage of the optional fourth parameter—instance number. After we know how many spaces there are, we replace the last one with a dollar sign ($):SUBSTITUTE(String,” “,”$”,LEN(String)-LEN(SUBSTITUTE(String,” “,””)))
  2. Now, if we just find where the dollar sign ($) is, we can determine the space that position that separates the full name from the state abbreviation: FIND(“$”,SUBSTITUTE(String,” “,”$”,LEN(String)-LEN(SUBSTITUTE(String,” “,””)))) Let’s call this formula DividePosition for simplicity.
  3. And once we know where the line of separation is, we can take everything to the left of it, [left(string,DividePosition-1)], as the full name and everything to the right, [right(string, len(string)-DividePosition), as the abbreviation.
  4. Once we’ve done all of the dirty work, we can copy and paste values.
  5. ..then roll our sleeves back down J

Don’t worry, I’ve included a spreadsheet (Converting_Dirty_Data_StateAbbrev) to help you understand. And as always, feel free to drop me a line if you have questions.

Advertisement

About bbluford
I am an executive finance professional with a love for process and application development (MS Access, Excel, Quickbooks), mostly as it relates to Accounting and Business Functions. I also love to write and share ideas with other people in this world. I'm an admitted Gym Rat who works out excessively. The best summation of me is that I love to teach and to learn.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: