July 11, 2007
Over the past decade, the quality of street-level
datasets has improved tremendously. Advancements include more complete
street geometry - exact location and shape, and more complete and
correct street-segment attribution - street names and aliases, house
number ranges, etc. Now, real estate parcel data have arrived on the
scene with a huge impact on applications that require the most accurate
location possible. PXPoint from Proxix offers users some options they
didn't have before, and the end result may be a "best match," based on
industry and application. This article provides background about how
geocoding has evolved, and why PXPoint is a good solution at this point
in that evolution.
Geocoding Evolution
Street-level geocoding in the United States became available in the
early '90's with the release of the Census Bureau's TIGER street
network data files (and even earlier, with the Census Bureau's DIME
files, though these were difficult to use). At that time, the founders
of Proxix Solutions (the company was then called QMSoft) introduced key
industry firsts including national geocoding from a single CD and a
single dataset. Before that, users often had to geocode address sets
one county at a time! Another Proxix first was the concept of
conflating (combining) the street-level spatial data with the U.S.
Postal Service (USPS) dataset - allowing Proxix to create the very
first Coding Accuracy and Support System (CASS)-certified geocoder.
(More information about the CASS system is available
from the USPS.)
Why was this conflation necessary at the time?
When the first TIGER files for Washington D.C. were produced (truly an
impressive milestone), inspection showed that significant work
remained. For example, attribution across segments of a street (one
block to the next) might differ widely. Inconsistencies appeared in
street names - suffixes (Ave., Dr., etc.) were dropped, directionals
(S, NW, etc.) were incorrect or missing, and house number ranges might
be missing for segments at the end of a road and sometimes even from
the middle of a road!
In 1993 Proxix founders developed a method to improve the attribution
quality of these street-level datasets by conflating the USPS data and
street data. This preprocessing corrected missing or wrong attributions
on many road segments, while introducing relatively few new errors.
Such introduced errors - then a necessary evil - came from the
statistical approximation used when comparing addresses from the two
data sources. With monthly or quarterly dataset updates, each release
held the likelihood of some error introduction. Today, the attribution
completeness and quality in data from companies like NAVTEQ, and even
the Census TIGER files themselves, has vastly improved. These
improvements have eliminated the need for conflation - and with that,
the damage caused by its introduction of new errors. Moreover,
conflation may actually introduce more errors than it fixes.

Conflation's Faults - Introduced Errors, Loss of Data
Figure 1 illustrates the problem with the conflation preprocessing.
When comparing a street-level dataset of addresses to the USPS
database, numerous addresses will match "closely" (shown in yellow in
Figure 1). Here, a "match" is a statistical determination based on
analysis of a) the actual components of any two addresses, and b)
consideration of alternative addresses which are similar in some
respect. Yet, numerous addresses (a differing set with every new data
source update) will not match acceptably. The problem with conflation
revolves around how these cases are handled.
With conflation, a bias must be applied toward one of the data sources
- usually toward the USPS database. Thus, data in the street-level
source will be replaced with data from the USPS source and will be,
from that point forward, lost for all geocoding tasks. Every use of a
conflation-model geocoder is then subject to this original bias in the
data preprocessing step.
Of course, there is a need to work with the most "complete" data
possible for superior geocoding and address standardization. In the
past, conflation addressed dataset shortcomings by attempting to
provide a "superset" of information from two sources. Today, a much
better approach is to retain the original attribution for all data
sources (no longer limited to just two – it can now include those
parcel datasets), and derive best match answers through comparison of
all of the sources simultaneously.
With this approach, the definition of what's "best" can be driven by
the specific application domain rather than being determined a priori
(ahead of time) from a one-size-fits-all conflation process without
regard for industry and application requirements.
Why Is Complete Data So Important?
When an address is submitted for geocoding (or address matching in any
way), the matching algorithms identify a set of "possible" matches for
more comparison. Many times, submitted addresses are missing (or
contain misspelled) components - street names, suffixes or
directionals. Algorithms assume that these errors may be present, and
thus will consider variations of the submitted address.
Consider the following example (assume city and state are
matches).
![]()
Using a USPS-biased conflated data source, most matching systems will
accept "Hartland Rd" as a reasonable "correction" for "Harpland Rd"
(missing from the USPS database). It's a good approximation because
there is but a single character difference in the street name. However,
by maintaining data sources in pure form through the geocoding step,
you can provide a match to "Harpland Dr" and avoid an incorrect match
to "Hartland Rd". Here, you can give more weight to the provided street
name - Harpland - than to the provided street suffix – Rd (which at
times is incorrect, or perhaps absent altogether). This important
determination avoids a situation referred to as a false positive.
Why would Harpland Dr be missing from the USPS database? Probably
because it's an address that does not receive mail service. The rule
is: If the USPS does not go there, it is not in its database. It is not
uncommon for residential addresses to receive mail only through a PO
Box. Here's a second example: electric utility service might include
both a billing address and a service address. The service delivery
point may not receive mail - it could be a warehouse, a thoroughbred
horse barn, a utility shed, pump station, even an oil well! If you need
to route to such a service address, the USPS data will not help. It
could even give the wrong location.
Address Standardization and the CASS Confusion
Often, the goal output from a list of addresses is both a list of x,y
coordinates and a CASS-certified list of those addresses. PxPoint will
do that. PxPoint uses all street name aliases provided by the USPS and
spatial vendors for enhanced matching.
Some geocoding systems indicate that they are CASS-certified, or
provide CASS-level responses. CASS stands for Coding Accuracy Support
System and goes beyond address standardization - its primary goal is to
secure postal discounts for mailers. To qualify for these discounts, a
distinct list of addresses must be processed and a report included with
the mailing to indicate the results for those addresses. CASS may be
important for those sending mail via the USPS.
However, CASS has an important drawback for organizations geocoding for
alternative carrier delivery, routing, mapping, assessing insurance
risk, or determining applicable tax authorities. Why? As mentioned
above, not all addresses are USPS-deliverable. By geocoding in CASS
mode, non-deliverable addresses cannot be geocoded and will be
rejected. Any warehouses, commercial properties, utility service boxes,
oil wells and even those residences not visited by a USPS carrier (mail
is delivered to a PO Box) will not be found in the geocoding step.
Do You Trust Your House Numbers?
One method utilized by address matching systems to increase speed is to
use house numbers as a filter to reduce the possible matches being
considered. After all, a numeric comparison is very quick, and if the
house number is not in range, match candidates can be discarded.
However, similar to the issue of geocoding incomplete/incorrect
addresses (missing or wrong directionals and suffixes), house number
filtering can lead to false positives when the provided house number is
incorrect - for example, if it has transposed digits.
Consider this example.
![]()
Many geocoding systems will match the input address to "Forest Trl NW"
because the house number fits and the street name "match" is not too
bad. PxPoint, however, will identify that "Forest Dr" is an exact
match, and the likely error is a simple transposition of the house
number (1540 should possibly be 5140). PxPoint will return a "street
centroid" geocode - it will not assume the transposition - but the user
will have the information necessary to accept or correct the match and
resulting geocode.
Each data source holds distinct content and provides its own strengths
for various location requirements - this is not a one-size-fits-all
industry. PxPoint's ability to deal individually and collectively with
various geocoding data provides a great benefit. sources.
|
Your Comments Post a comment All comments provided in this section are those of the individual who has created the post. These are not the opinions of Directions Media, its editors, staff or owners unless otherwise noted. Directions Media retains the right to edit or delete any comments posted herein.
|
|
||||||
| Wow, I hope Directions got paid for this advertisement er... article. |
||||||








