The Evolution of Geocoding: Moving Away from Conflation Confliction to Best Match

Over the past decade, the quality of street-level datasets has improved tremendously. Advancements include more complete street geometry - exact location and shape, and more complete and correct street-segment attribution - street names and aliases, house number ranges, etc. Now, real estate parcel data have arrived on the scene with a huge impact on applications that require the most accurate location possible. PXPoint from Proxix offers users some options they didn't have before, and the end result may be a "best match," based on industry and application. This article provides background about how geocoding has evolved, and why PXPoint is a good solution at this point in that evolution.

Geocoding Evolution
Street-level geocoding in the United States became available in the early '90's with the release of the Census Bureau's TIGER street network data files (and even earlier, with the Census Bureau's DIME files, though these were difficult to use). At that time, the founders of Proxix Solutions (the company was then called QMSoft) introduced key industry firsts including national geocoding from a single CD and a single dataset. Before that, users often had to geocode address sets one county at a time! Another Proxix first was the concept of conflating (combining) the street-level spatial data with the U.S. Postal Service (USPS) dataset - allowing Proxix to create the very first Coding Accuracy and Support System (CASS)-certified geocoder. (More information about the CASS system is available from the USPS.) Why was this conflation necessary at the time?

When the first TIGER files for Washington D.C. were produced (truly an impressive milestone), inspection showed that significant work remained. For example, attribution across segments of a street (one block to the next) might differ widely. Inconsistencies appeared in street names - suffixes (Ave., Dr., etc.) were dropped, directionals (S, NW, etc.) were incorrect or missing, and house number ranges might be missing for segments at the end of a road and sometimes even from the middle of a road!

In 1993 Proxix founders developed a method to improve the attribution quality of these street-level datasets by conflating the USPS data and street data. This preprocessing corrected missing or wrong attributions on many road segments, while introducing relatively few new errors. Such introduced errors - then a necessary evil - came from the statistical approximation used when comparing addresses from the two data sources. With monthly or quarterly dataset updates, each release held the likelihood of some error introduction. Today, the attribution completeness and quality in data from companies like NAVTEQ, and even the Census TIGER files themselves, has vastly improved. These improvements have eliminated the need for conflation - and with that, the damage caused by its introduction of new errors. Moreover, conflation may actually introduce more errors than it fixes.

(Click for larger image)

Conflation's Faults - Introduced Errors, Loss of Data
Figure 1 illustrates the problem with the conflation preprocessing. When comparing a street-level dataset of addresses to the USPS database, numerous addresses will match "closely" (shown in yellow in Figure 1). Here, a "match" is a statistical determination based on analysis of a) the actual components of any two addresses, and b) consideration of alternative addresses which are similar in some respect. Yet, numerous addresses (a differing set with every new data source update) will not match acceptably. The problem with conflation revolves around how these cases are handled.

With conflation, a bias must be applied toward one of the data sources - usually toward the USPS database. Thus, data in the street-level source will be replaced with data from the USPS source and will be, from that point forward, lost for all geocoding tasks. Every use of a conflation-model geocoder is then subject to this original bias in the data preprocessing step.

Of course, there is a need to work with the most "complete" data possible for superior geocoding and address standardization. In the past, conflation addressed dataset shortcomings by attempting to provide a "superset" of information from two sources. Today, a much better approach is to retain the original attribution for all data sources (no longer limited to just two – it can now include those parcel datasets), and derive best match answers through comparison of all of the sources simultaneously.

With this approach, the definition of what's "best" can be driven by the specific application domain rather than being determined a priori (ahead of time) from a one-size-fits-all conflation process without regard for industry and application requirements.

Why Is Complete Data So Important?
When an address is submitted for geocoding (or address matching in any way), the matching algorithms identify a set of "possible" matches for more comparison. Many times, submitted addresses are missing (or contain misspelled) components - street names, suffixes or directionals. Algorithms assume that these errors may be present, and thus will consider variations of the submitted address.

Consider the following example (assume city and state are matches).

(Click for larger image)

Using a USPS-biased conflated data source, most matching systems will accept "Hartland Rd" as a reasonable "correction" for "Harpland Rd" (missing from the USPS database). It's a good approximation because there is but a single character difference in the street name. However, by maintaining data sources in pure form through the geocoding step, you can provide a match to "Harpland Dr" and avoid an incorrect match to "Hartland Rd". Here, you can give more weight to the provided street name - Harpland - than to the provided street suffix – Rd (which at times is incorrect, or perhaps absent altogether). This important determination avoids a situation referred to as a false positive.

Why would Harpland Dr be missing from the USPS database? Probably because it's an address that does not receive mail service. The rule is: If the USPS does not go there, it is not in its database. It is not uncommon for residential addresses to receive mail only through a PO Box. Here's a second example: electric utility service might include both a billing address and a service address. The service delivery point may not receive mail - it could be a warehouse, a thoroughbred horse barn, a utility shed, pump station, even an oil well! If you need to route to such a service address, the USPS data will not help. It could even give the wrong location.

Address Standardization and the CASS Confusion
Often, the goal output from a list of addresses is both a list of x,y coordinates and a CASS-certified list of those addresses. PxPoint will do that. PxPoint uses all street name aliases provided by the USPS and spatial vendors for enhanced matching.

Some geocoding systems indicate that they are CASS-certified, or provide CASS-level responses. CASS stands for Coding Accuracy Support System and goes beyond address standardization - its primary goal is to secure postal discounts for mailers. To qualify for these discounts, a distinct list of addresses must be processed and a report included with the mailing to indicate the results for those addresses. CASS may be important for those sending mail via the USPS.

However, CASS has an important drawback for organizations geocoding for alternative carrier delivery, routing, mapping, assessing insurance risk, or determining applicable tax authorities. Why? As mentioned above, not all addresses are USPS-deliverable. By geocoding in CASS mode, non-deliverable addresses cannot be geocoded and will be rejected. Any warehouses, commercial properties, utility service boxes, oil wells and even those residences not visited by a USPS carrier (mail is delivered to a PO Box) will not be found in the geocoding step.

Do You Trust Your House Numbers?
One method utilized by address matching systems to increase speed is to use house numbers as a filter to reduce the possible matches being considered. After all, a numeric comparison is very quick, and if the house number is not in range, match candidates can be discarded.

However, similar to the issue of geocoding incomplete/incorrect addresses (missing or wrong directionals and suffixes), house number filtering can lead to false positives when the provided house number is incorrect - for example, if it has transposed digits.

Consider this example.

(Click for larger image)

Many geocoding systems will match the input address to "Forest Trl NW" because the house number fits and the street name "match" is not too bad. PxPoint, however, will identify that "Forest Dr" is an exact match, and the likely error is a simple transposition of the house number (1540 should possibly be 5140). PxPoint will return a "street centroid" geocode - it will not assume the transposition - but the user will have the information necessary to accept or correct the match and resulting geocode.

Each data source holds distinct content and provides its own strengths for various location requirements - this is not a one-size-fits-all industry. PxPoint's ability to deal individually and collectively with various geocoding data provides a great benefit. sources.