Geocoding, Data Quality and ETL

By Hal Reid

This month's Location Intelligence Magazine explores the three areas of geocoding, data quality and ETL (extract, transfer/translate, load), which are all inter-related and help us get to the understanding of where things are, in terms of the enterprise systems for assets, customers and business development.

In systems that are data-driven, there is always the perpetual quest for clean data.We want data that is correct, remains that way; and that we can reliably use to understand, make decisions and even feel confident to use it as a mailing list.

But data quality and usability doesn't happen by chance.It is interesting to look at an example of something we take for granted today; something that just a few years ago was pretty imprecise or almost non-existent, i.e.accurate geocoding.How about finding a route from your house to your friends' new house anyplace in the U.S or Western Europe? It is pretty simple today either via the Web or with your inexpensive desktop mapping program.But to put that infrastructure in place so it works first time, every time, was not trivial.

Several years ago, Matt Jaro, one of the world's experts on geocoding was telling me about some of the issues in geocoding.Things like the use of soundex (it sounds like...) and reverse soundex (it doesn't sound like...but really is...), non-linear address ranges and the problems of parsing an address.For example, 123 Main Street is pretty straight forward, so is 123 Main St., but they are not the same - St, Street.How about Maine Street or Mane Street? Note that Main and Maine sound the same, so does Mane.Hmmm, which one is correct and how do know which one is right? Then there is Sherlock Holmes address, 17B Baker St? Is the B a phonetic or part of the address?

In Japan, the addressing scheme is determined by when it was built, not where it is on the street.So #1 Honda St is not necessarily next to, or even across the street from #2 Honda St.A common world wide problem is the abbreviations used in addresses; another is simply the misspelling of the address.You can see that getting a good geocode in not trivial.With geocoding, the problem is both with the original data and the end user.

The quest for good, clean, accurate data not just for geocoding but all of the other uses for clean data.The pursuit for clean data is not that different from other great searches and all the questions they raised.Columbus didn't have good data and really didn't find India, but believed he did.Cortez had the same problem and without local help would still be stuck on the beach.Bad data always initiates the search for good data, but only when it is discovered.

The questions for geocoding, data quality and ultimately for ETL, are, if we don't know the data is bad, is good simply relative? Does good data remain good when transformed? And, is the quest for good data as important as the data itself?

For the answers to these and other questions, explore this issue of Location Intelligence Magazine.

Published Thursday, October 27th, 2005

Written by Hal Reid

If you liked this article subscribe to our newsletter...stay informed on the latest geospatial technology

© 2016 Directions Media. All Rights Reserved.