Report from OGC/W3C Linking Geospatial Data Workshop

60 Second Executive Summary

A lot of the discussion across the two days of the workshop can be summarized in the paper and presentation by Frans Knibbe of Geodan. By introducing the characters of Miss Globe and Mr Cube, he posed some key questions:

how should we encode geometry?
how and where should we implement topological functions?
additional metadata is required for spatial datasets – how do we do that?
where is the software support for spatial datatypes and functions?
geometries expressed as WKT literals are large objects — the Linked data world is used to handling simple literals;
how do we help developers handle (or avoid) the steep learning curve to work with Linked Data?

Introduction

The Linking Geospatial Data Workshop was billed primarily as a joint exercise between the Open Geospatial Consortium (OGC) and W3C as part of its role within the SmartOpenData project. It was supported and encouraged by the UK Government's Department for the Environment, Food and Rural Affairs, the Ordnance Survey and Google Maps who hosted the event. It came about through a desire to make better use of the Web as a platform for sharing and linking Geospatial Information (GI) alongside but not instead of existing GI systems.

The title of the event was carefully chosen, particularly using the word 'linking' not 'linked.' The chairs did not want this to be a specifically Linked Data event, rather it was to be an event about how different GI datasets can be linked and accessed on the Web. As it turned out, many presentations were directly from the Linked data community and it's clear that it has a lot to offer in the GI field but this was not a given before the event.

In numbers: the workshop attracted 72 papers, 106 participants (with several more on the waiting list), 38 presentations, 16 panellists and 8 bar camp pitches. A selection of tweets were archived and minutes were taken throughout the event on IRC (day 1, day 2). Ed Parsons took many photographs of the event, all of which are available and some have been used to illustrate this report with thanks to Ed.

The outcome is also numeric: the assembled participants expressed a desire for one working group to be established by the two standards bodies working together.

Recurring Themes

The first sessions highlighted a number of themes that recurred throughout the workshop: commonly agreed sets of objects (cadastral parcels, addresses etc.) are all important but how should these be accessed via HTTP? Use of HTTP URIs as identifiers would be an enormous step forward in interoperability between different GI datasets and the Web at large. However, different GI systems and different organizations are likely to generate duplicate URIs as identifiers. How should that be handled?

Some systems use lat/long pairs plus an address, others use NUTS codes to refer to administrative units – it's not easy or obvious to know which to use and therefore different systems are used inconsistently. The workshop heard about two contrasting Australian projects, both requiring data modeling based on OGC standards. One used WDTF, a precursor to WaterML2.0, and faithfully translated UML models into OWL but was time consuming, error-prone and ultimately ugly. The other used a combination of the Semantic Sensor Ontology, the Data Cube Ontology and followed a more Semantic Web development style, transforming CSV data to RDF, but reflecting OGC standard origins. This proved much easier and provided a better outcome. Takeaway message if you're going to publish Linked Data then it's easier if you do the modeling with Linked Data in mind rather than trying to bend it to fit a UML model.

A recurring theme was that complexity is not always required although sometimes it is. For example, in many applications on the Web and elsewhere a set of coordinates defining a polygon can be treated as a single object and metadata can be applied to the whole thing. In other, rarer, cases, it may be important to be able to access metadata about each individual coordinate pair. Defining boundaries in disputed areas is one such case.

Location is important in many systems that one may not immediately think of in this context. The workshop heard from the German National Library's Lars Svensson and the BBC's Chris Henden, both of whom talked about how their resources very often are about or refer to places (think of news stories about revolutions in country X or historical documents concerning region Y).

A session of lightning talks looked at historical and current approaches to persistent identifiers. The need for these is not new but the use of the Web and the growing use of real time sensors means that care needs to be taken to track versions of identifiers. In many cases the use of a simple identifier for 'the latest version' and then dated identifiers for specific versions is sufficient (this is how W3C manages its standards for example), however, when individual terms in a controlled vocabulary are updated or new language labels added, the number of versions can quickly become unwieldy and hard to use, adding complexity where it's not wanted.

Heterogeneity & Volume

The gathering of and need to make sense of geospatial data is not new and the volumes of data are never small. The World Meteorological Organization and the Woods Hole Oceanographic Institute have been doing this for a long time, for example. Geonames is the de facto world gazetteer but linking to it from legacy systems requires effort. Likewise building services from raster satellite images requires a lot of work. It's all possible, just not trivial.

The Android App Geohazard uses a particularly large number of heterogeneous data sources and Isao Kojima of AIST described the challenges of integrating static data with data from 4,000 sensors every 10 minutes in monitoring the radiation from the Fukushima power station. In DEFRA's case, the heterogeneity between official reference data and social data emerging during a crisis such as localized flooding presents particular challenges. How can these disparate sources be made more interoperable and integrated into better decision making applications? Can small companies help bridge the gap between Earth Observation data and social data? What use can governments make of crowd sourced data alongside official registers – do we need data quality flags?

Executing a conversion or any other process on large amounts of data suggests that the best approach is to move the application to where the data is, i.e. run the application in the cloud and return the result. Web APIs are best suited for relatively small amounts of data – but maybe we need standard methods for returning subsets of large datasets?

Several projects were presented, including the host project SmartOpenData. A common theme is the need to make disparate datasets interoperable but there is also a question of longevity. Tools may be developed and published on GitHub – fine – but what about services? Too few projects have long term business models to enable their services to survive the initial funding. At one end of the scale, the TELEIOS project and its follow up, LEO, developed the Strabon platform including spatial and temporal extensions to SPARQL to deal with data that varies over time (stSPARQL). That's a major service that will require ongoing funding and/or community support if it is to survive. At the other end of the scale, an ISA Programme pilot on Belgian addresses simply used Open Refine, a freely available and widely used tool, to process address data into RDF that was published through a Virtuoso instance … which may or may not remain online long term. Can any of these tools be used in the long term?

Where tools do exist and look set to be available long term they may not always offer the complete suite of functions defined in the relevant specifications and the 'optional' components are often required. Large scale open source implementations of WFS 2.0 feel like linkable data already and the move from URNs to URIs in INSPIRE helps a lot but not all the software is there yet. There was general agreement that it's difficult to do relational joins with WFS for which GeoSPARQL is better but it's very much dependent on the context and what you’re trying to do.

Establishing an infrastructure is rarely easy. The workshop heard from the Scottish and Welsh governments who know that the desired goal is a combination of geospatial and Linked Data tools but getting there from where they are – reports in PDFs, an infrastructure built of .net and relational data – is hard. There's a need for more tools that allow for anevolutionary approach rather than a big bang.

The Ordnance Survey pioneered the use of Linked Data for geospatial but maintaining the service depends on showing that it is being used. Server logs only tell you so much (aside, this issue was raised in the Open Data on the Web workshop last year and lead directly to the Data on the Web Best Practices Working Group being chartered to create a data usage vocabulary). Peter Parslow pointed out that a lot of effort went into including semantics in the INSPIRE model. It is for ‘the LD community’ to decide how best to represent those semantics for use over the web (SKOS etc) but the actual semantics always needs to be determined by the domain experts.

Different Projections

There was a lot of talk about GeoSPARQL during the workshop but several speakers suggested that in some circumstances other approaches are easier to use. An open GeoSPARQL endpoint raises expectations that expensive queries can be handled which may or may not be true. Specific RESTful APIs are often better. For example, GeoGratis, the Web portal for Natural Resources Canada, offers an API that uses a hypermedia approach to delivery of linked geospatial data. Each resource is offered in multiple formats either via standard content negotiation or via typed links to API query resources.

The big issue with GeoSPARQL is that so much information is encoded in big WKT literals: the coordinate reference system, whether it's a point, line or polygon, and then the coordinates. NeoGeo, a vocabulary developed in parallel with GeoSPARQL, does things slightly differently with different predicates for different geometries and offers content negotiation on those geometries. Being able to split out the CRS has advantages in some situations and the Data Lift project offers a REST-based coordinate conversion service (primarily for WGS84 to and from Lambert93 as used by IGN in France).

The discussion around these topics again raised the issue that one size doesn't fit all. Manipulating the geometry in the graph is something that may be useful when considering territorial disputes but will be much more detail than is required in other circumstances.

Historical GI

The workshop heard about several projects working with historical geographic data. For example the Muninn project is mapping WWI trenches using the good Belgian and German maps of the day and, less so, the French maps where each 'point' in reality represented a 20 yard circle of uncertainty. The Virtual Cilicia project is working with field boundaries derived from maps of the street system in 200 BC. It’s often more useful for historical data to talk about events and relative time (i.e. in the Nth year of the reign of King Joe) rather than ISO 8601 dates and this session again highlighted the geographic information engineer's view of time as a coordinate reference system.

Front End

Several talks looked at geospatial data from a developer's point of view. Linked Data provides exactly the tools needed to integrate heterogeneous geospatial data but developers find SPARQL slow to execute and/or hard to learn. The RAGLD project is powered by Linked Data but offers geospatial tools and APIs for people who don't care about the technology; Map4RDF offers map-based visualization of statistics using the Data Cube Ontology but again is for people who know and care nothing about Linked Data.

The data format of choice for developers is GeoJSON which is widely implemented and supported by many tool chains. Bert Spaan described how the CitySDK project's API returns a unified JSON format for data originating from XLS, SHP, XML, CSV, RDF and KML files and pointed to the OGC Table Join Service for linking geospatial and tabular data. (Geo)SPARQL queries can be optimized if the system knows what the data looks like and APIs that return JSON are popular. The (Geo)SPARQL endpoint should still be accessible to developers who want it, however.

Crowdsourcing

As noted earlier, integrating social data with official reference data is particularly hard. Rich Boakes presented his work on crowd sourcing data concerning incidents where car drivers put cyclists in danger. Using a Raspberry Pi, GPS and an ultrasonic sensor mounted on his bike he's able to record the location of the incident and the registration plate of the car in question. The Open Data Institute has a project to crowd source data about open spaces in urban areas – how will any of this sit alongside data that is not fully open since many maps are only published as open data down to a certain resolution?

Conclusion

The final plenary session of the workshop allowed participants to reflect on what had been said and to draw conclusions. There are many relevant standards in existence. There are other 'standards' that, although massively implemented, are not formalized which presents a problem for some government bodies: GeoJSON is top of that list.

In short: there is work to do to tidy up some existing standards and to provide guidance on how developers and publishers should proceed. The lack of coherent advice to publishers is such that a lot of data isn't published (at least in a linkable form) and where it is published it is less attractive to developers than it could be. OGC and W3C represent different communities and therefore a joint working group is required to create or recommend standards that work across those communities. OGC and W3C committed to work together towards establishing such a group.

The chair of the final session, Stuart Williams, asked each panelist what was at the top of their shopping list for such a working group:

Steve Peters: Helping me as a user to navigate through the maze of standards, to show the best way to prepare, publish and encourage reuse of data. Best practice, support, peer networking.
Keith Jeffery: Standardized terminology of metadata, taking into account the different levels of metadata required by different applications.
Alex Coley: There are lots of parts of the picture that I'd like to see brought together into a coherent whole.
John Goodwin: +1 to what's been said… agreement on best practice and the standards, tools to implement them.
Kerry Taylor: A shared spatial vocabulary that's useful and easy to do.
Raphaël Troncy: +1 to Kerry.
Bill Roberts: I think we have enough standards, but I don't know how to use them. So: design patterns where people can document how they use standards.
Jeremy Tandy: We've identified 4 geospatial vocabs for identifying geometry. Could we get it down to 1? Agree it? NeoGeo? Core location? GeoSPARQL? W3C Geo? Can we have a vote?

The workshop ended with a bar camp which included sessions on Fuzzy URIs proposed by Bert Spaan based on his work in the CitySDK Project, and the integration of non-authoritative data in decision-making processes proposed by Bente Lilja Bye. Jeremy Tandy lead a discussion that further developed the ideas in the final session.

Repinted from W3C, under CC-BY-NC-ND. Images and asides were omitted due to my (Adena Schutzberg's) limited HTML skills.