ROI Case for Geospatial Data Quality Management - Accurate Data and Automated Decisions

By Steven Ramage

Typically most organisations spend many years collecting spatial data and integrating or conflating (merging together) their own data with third party reference data.By third party data we mean counties in the USA using TIGER data with their own asset data, or local authorities in Great Britain using Land-Line or OS MasterMap in conjunction with gazetteer information.

Many organisations invest significant sums in collecting geospatial data.The demand for the 00 decade is for joined-up decision-making.In a geospatial context, geometric data quality deserves as much attention as alphanumeric data quality.

One organisation we interviewed recently spent over £1million on geospatial data over the last decade and currently all they can do with their data is basic thematic analysis; the data are not working to full potential.This is because the true value is realised when data sets are successfully conflated and attribution takes place.(Here's an example - a building in the estates management application is known as "Town Hall" whilst the same building in the English Heritage application is a Grade 1 listed building.) Creating this attribution from internal and external registries is equivalent to mainstream IT data mining.In order to achieve this level of joined-up information, a geospatial data quality discipline is required.In particular, the metadata and the establishment of provenance are paramount.

It is not only spatial data that are affected by data quality issues. In an issue of Computing published in 2002,
Forrester commented that firms seeking ROI on CRM faced: "...the stark reality of languishing applications, lousy data and dysfunctional organisations." More recently Oracle Magazine (March/April 2005) cited a report by the Data Warehousing Institute, which claimed that US businesses were losing more than $600 billion each year due to data quality problems.

I think my data is accurate...but what if it isn't?
Let's take the example of organisations that use land parcel boundaries to determine agricultural subsidies.What is the impact of incorrect boundary or ownership information?
  • Financial - overpayment or underpayment of subsidies leading ultimately to fines
  • Legal - fraudulent claims against land parcels that are not rightfully owned by the claimant, or, legal recourse for claimants that are underpaid
  • Administrative - overhead costs in terms of people and time that will have to be allocated to reconciling conflicting claims
  • Technical - why has this error arisen in the first place? Software problems?
  • Lower customer satisfaction - underpaid claimants will be dissatisfied
  • Poor decisions - may lead to political embarrassment
  • Long term business strategy affected - decisions and forecasts continue to be based on poor quality data.
More importantly, PIRA (Commercial Exploitation of Europe's Public Sector Information, 2000) estimate that by 1999 it would cost the EC countries (as they were then) €36bn to replace its geographical information assets.This amount was estimated to be growing at €4.5bn per annum.Similar costs for the US were estimated at $375bn with a $10bn growth per annum.These figures will almost certainly have accelerated after 9/11 with the focus on Homeland Security.In triage terms, guaranteeing the accuracy of the spatial data is paramount.

But my data is accurate!
Are you sure? Data accuracy has to be seen in context.Most business data has been captured against national mapping agency reference data and has been represented as a map, normally for visual interpretation. The human brain is excellent at adjusting information so that it makes sense, so that it is 'fit for purpose'.Encoding such rules and capabilities in computer systems so that data can be automatically processed in an advanced decision making system, is a different task. Data accuracy cannot be found wanting when the decision-maker is remote from the data creator.

Indeed, working together with customers and partners, Laser-Scan has reviewed thousands of geospatial data sets that have been collected over the last 15-20 years and data quality is an issue with most.It is a shock to the user, since they have invested heavily in collecting the data, as we have seen.Often, organisations are convinced that their data is clean and accurate, but analysis has revealed that as much as 75% of the data is riddled with errors.What has caused these errors? How do they remain undetected? How can organisations avoid them?

The root of the problem
There are many reasons why data may not be conforming to the standards expected.
  • The use of GPS has introduced a more rigorous accuracy for the reference data set
  • GIS installations are only just beginning to understand the importance of metadata and configuration management
  • Lack of understanding that data may be inaccurate
  • Data duplication - the same dataset is held by different departments but is rarely synchronised
  • Conflation is only just becoming widespread in response to joined-up government initiatives
But perhaps one of the most important reasons is the implicit, as opposed to explicit, rule definition.A spatial representation of business rules might be as shown in the figure below.

Figure 1."Airport" feature must be within 4km of its "Service Point" feature. Source: Laser-Scan

Figure 2.A series of features defined as "Motorway" with attribute."Slip Road" (exit ramp) must not be longer than 2km.Source: Laser-Scan (Click for larger image)

If these rules are not explicit in the database, then a breach of the rules may go undetected.

Until recently, data quality software was mainly found in marketing departments where it was used to tidy up mailing lists.Now organisations in the geospatial industry are tackling address geocoding issues head on and are starting to examine the impact of logical consistency and data errors on business processes.

So what do we mean by spatial data errors?
A whole range of errors can creep into a geospatial database over time - and many of them are very difficult to detect.In addition to the usual alphanumeric miscoding, geospatial databases contain geometric errors: overshoots; undershoots; slivers; kick-backs; ticks; spikes; small loops and short lines.

Figure 3.Radius Topology uses an angle and length parameter to automatically remove spikes.Source: Laser-Scan (Click for larger view)

How can I get rid of these data errors?
Before any programme of data cleaning starts, it is essential to know exactly what needs to be done - an audit, in other words.Once it is clear what the problems are then the cleaning process can begin.

Much of this task can be automated with the majority of the geometric errors corrected automatically according to business rules set by the user.The rest of the errors are highlighted for manual correction. Topology plays a key role in this task.

How can I avoid these errors in the future?
Once data is clean, it is important to follow up the investment by maintaining its quality and preventing the occurrence of new geometric errors.

Case studies in Managing Geospatial Data Quality

Improving data quality at London Borough of Enfield, UK
London Borough of Enfield (LBE) wanted to centralise its business and spatial data as part of its preparations for meeting e-government targets, but the data held too many inconsistencies for a seamless merge.After they went through a clean-up process, the data matches OS MasterMap's Topography layer and is ready to be shared across the enterprise (see Figure 4).

Figure 4.Data Cleansing in London Borough of Enfield.Source: Laser-Scan

Interoperability at Staffordshire County Council, UK

Staffordshire County Council (SCC) has spent significant sums on geospatial data in the last 10 years, but has found that the data is underutilised.A solution was needed to increase operational efficiency through data sharing.Laser-Scan implemented a single, central Oracle 9i database to store OS MasterMap, making use of Laser-Scan's Radius Topology to keep asset data clean and correctly aligned with the OS MasterMap data.Business and spatial data are now linked and available for sharing across the organisation via an Intranet; this enables SCC to perform speedy business analysis, such as identifying asbestos sites in the region.SCC can now rely on its data to yield better decisions, while benefiting from time savings and improved productivity.

Cutting the cleaning bill by 75%
A large mapping agency recently estimated that it takes 12.5 man years to clean geometric errors found in a year's data collection effort from its external suppliers.They were keen to identify ways of reducing this cost and protecting the data from incurring errors in the future. Since the data were destined to be used for map products, the strictest accuracy was essential.

A feasibility study was undertaken involving the integration of Laser-Scan's Radius Topology with Intergraph's GeoMedia Pro, which was already in use by the organisation.Radius Topology works behind the scenes and is invisible to the user so no extra training was required.

The organisation was able to use its existing workflow to process the data, with the added benefit of having the majority of geometric errors (often so small as to be virtually impossible to detect) automatically corrected by Radius Topology.Any errors that were not automatically removed were highlighted for manual correction.This new solution is expected to reduce the estimated workload by 75%, replacing a 12.5 man year effort with just a three man year effort

Cleaning data for the future in Amsterdam

The City of Amsterdam is bringing its maps up to date with an innovative modernisation programme.The City's original data contained many typical geometric problems: spikes; kickbacks; undershoots; overshoots; gaps and slivers.As well as removing these errors it was also vital to address less obvious problems such as exactly duplicated features.The challenge was first to detect these data errors and then to correct the majority of them automatically.

It is possible to automate an estimated 90 to 95% of the initial data correction.The remainder of the problems are highlighted for manual correction.This means that the City's data are accurate and error-free at all times, as well as being aligned to national standards; this ensures the delivery of high quality data to end customers, both online and offline.

Published Wednesday, April 13th, 2005

Written by Steven Ramage

If you liked this article subscribe to our bimonthly newsletter...stay informed on the latest geospatial technology

Sign up

© 2017 Directions Media. All Rights Reserved.