Data Quality Challenges in 2007

"Data Quality is a problem we need to address." I couldn't believe I was reading these words in Dave Sonnen's article (Emerging Issue: Spatial Data Quality), published on January 4. At 1Spatial, we had been working hard to address matters for over a year and had reached the point where we felt there was a momentum behind this issue. So much so that we had put forward for vote a draft charter for an Open Geospatial Consortium (OGC) Working Group. But Sonnen is right; there is more work to be done.

This article is aimed at bringing together a series of strands so that progress can be made on this whole data quality issue. I'll provide some historical background about a past effort that wasn't particularly successful, then look at the factors that augur success, and describe what I think the industry needs to do to make meaningful progress this time around.

Some historical background
As they say in science fiction literature, "there was a time before" - these data quality issues have been around and have been discussed. Shortly after the Digital Chart of the World (DCW) was produced in 1992, there was an effort to work on "fit for purpose" issues ("fit for purpose" refers to the notion that the data may be adequate for one use, but may not be adequate for another).

The effort addressed interoperability and how it related to using data outside its original purpose. This work took place long before the term interoperability was ever used and on a scale that was unique and will probably never be seen again. The DCW, which was designed for use in operational navigation charts, was finding its way into other arenas. Why not? The DCW was free and in the public domain, so it was inevitable that people would push to use it in ways other than those originally intended. As people did this, new issues became apparent about this exciting new resource; it was not fit for purpose. Indeed, why should a dataset created for the purposes of medium altitude en route navigation by dead reckoning visual pilotage be of any value in assessing drainage basin characteristics?

These realizations led a committed group of people to create what has become ISO 19113/19114. These ISO standards embody principles and evaluation procedures for geographic information (for a neat summary of these principles and procedures, see Chapter 15 in Spatial Data Quality, Shi et al., 2002). The opening sentence of ISO19113 sets the tone:

'Geographic datasets are increasingly being shared, interchanged and used for purposes other than their producers' intended ones.'

They developed the standards under the auspices of ISO/TC 211 and then applied them using the DCW project as a case study. We have to thank the Scandinavians for much of this work. But in 1995, the initiative ran out of steam.

The effort ran out of steam because only non-quantitative (i.e. qualitative) assessments of the quality of geographic datasets could take place. Quantitative assessments were just not possible for large GIS datasets because of the lack of processing power available at the time. It is quantitative assessment that is really valuable for assessing logical consistency and positional accuracy. And as Jakobsson said so eloquently in Spatial Data Quality, "Combining data sets that have no quality information can be very difficult or impossible."

Success factors in 2007
Since the mid '90s, many individuals working with the OGC have done a lot of hard work on standards, creating and overseeing the Web Feature Service (WFS), Web Map Service (WMS) and Geography Markup Language (GML). Thanks to those standards, other publicly available datasets can now be "seen" and accessed. All of that data availability only makes the data quality issue more urgent.

We have service-oriented architecture (SOA), the World Wide Web Consortium (W3C) semantic Web framework and the Web Ontology Language (OWL), but the semantic Web rules language doesn't support our space. We now have to grunt to get the job done providing quantitative assessments of spatial data quality. As an industry, Google has raised our profile to the point where it will hurt us all if data quality is suspect.

Places to Make Progress 2007
This is a content mad world. Billions of dollars have been spent on spatial data that are not positionally accurate in a GPS context. These data are being catalogued into Web-based registry services being built by various organizations around the world, with the idea of making them as available as possible. Industry practitioners have access to internal ISO 9001 procedures defining internal quality regimes, but those procedures don't define external fit for purpose. Fit for what? This is where we start. The solution, I believe, is to "fix the supply chain." Here's what I propose.

Step 1. We have to define the domain requirements. The DCW themes or INSPIRE Annex I themes are a good place to start. We don't have to start fresh. What does the user need in order to aggregate data across administrative boundaries such that the combined data set is useful in helping to make decisions about routing, flooding, development? Let's define that.

Step 2. Measures need to be put in place that can describe "quality" and those measures need to be included in the metadata layer. These measures have to be manageable. They can't be the full GML application schema. Why? A long time ago I was amazed that a French water utility group managed its global business on 12 key performance indicators - the experience opened my eyes to the fact that simple schemes work better than complicated ones. The production end of the supply chain can't be overburdened if it's to produce. When I say "production end" I'm not only referring to statutory bodies, but also to all those people creating content through KML (Keyhole Markup Language). For knowledge to grow we have to be able to aggregate content meaningfully, and the mashers of this world - many of whom use KML - are creating content faster than any other group.

This step also needs to include defining the rules expression language for creating the measures. The Semantic Web Rules Language (SWRL) isn't mature enough, but I think we in the industry can create a standard. Once we have this, rather than having to accept caveat emptor, I can make my own assessment of whether the data are fit for purpose. This has to be an automated assessment, and unlike in 1995, the tools and computing power are now available to do this. I may need a 14 day free trial to make my assessment, if the data are not free! Once I've made the assessment, I can then decide whether I am going to pay or how much I am going to pay for those data. The free vs. licensing debate becomes irrelevant.

If the rules expression language can't be created fast enough, we can move to the geographer's solution, the "pseudo-quantitative expression." Following the Amazon tradition of peer group review, the user community could assess the nominal value of a spatial data set for completeness, logical consistency, and positional, temporal and thematic accuracy.

Step 3. We have to ask ourselves, can the existing data sets be used? The answer has to be yes, as there are over €100 billion of public sector geographic data in the EU27.
They can't be thrown away - they have to be used, and if possible, fixed so they are fit for purpose. This might be restated as: Can the supply chain be extended? Using automated tools, the rules described in Step 2 could be applied not just to measure and create a metadata layer, but also to fix up the data. If the fix-up process is true value-add, then a charge could be applied.

In any event, there is a lot of work to do on the existing public sector geographic information to make it fit for use in situations like routing, planning and, above all, emergency response. The momentum is there. All you need to do is to contribute to the OGC initiative.