How can I use crowdsourced data?

A week ago I published the presentation I had given during the ENERGIC Workshop at the University of Zurich. In that I talked about crowdsourcing and VGI and about using crowdsourced data in applications.

I’d like to delve deeper into one of the examples I have talk about in my workshop presentation: Some weeks ago, Strava, a provider of a fitness tracking app for bicyclists and joggers, published a so-called heatmap of all recorded GPS locations of its users. Similar data has been published by comparable providers such as Runkeeper.

For what purposes can we use such crowdsourced data? Which questions might they help answer? As an example, I have coded up a small web map that overlays the bicycle infrastructure of the city of Zurich (thank you, Open Data team) onto the Strava heatmap. You can explore this map here below (full-screen view can be toggled using the button on the left, below the zoom controls):

What do you think? Is this comparison of bicycle infrastructure and recorded tracks a useful application of crowdsourced data or VGI?

– I don’t know.

From a data scientist and a traffic engineering (and cyclist’s) viewpoint I like the idea of highlighting gaps in the city’s bicycle infrastructure using crowdsourced data very much! However: there are a few questions and potential problems such as:

In its majority the Strava data encompasses cycle routes (77,688,848 globally), but not just (globally, there are 19,660,163 recorded jogging routes).
How sure can we be that users have switched their app to the correct tracking mode (cycling vs. jogging)?
Are there users who use their fitness tracking app for recording their car-based travels, their motorbike excursions or their sunday walks with the dog? If yes, does Strava take steps in order to remove such recordings from their user data, for example by filtering based on velocity/acceleration characteristics?
How many users have contributed their data in the Zurich city area, in a neighbourhood, on a much-used/little-used part of the network?
Are there much-used routes that – counter-intuitively – have been used and recorded by only a limited (but enthusiastic) bunch of cyclists?
On the other hand: are there much-used routes that have better “democratic legitimacy”, i.e. that have been used by a large number of different people on different occasions?
If we could detect and distinguish these two kinds of routes: what would this tell us?
What kind of recording errors (e.g. insufficient GPS coverage, multiple GPS reflections off buildings or trees) might be present in the data? How would such errors influence our intended analyses?
What does the temporal distribution (currency) of the data look like? Has the majority of the data been collected over the last three years, the last year, the last six months? And how might an uneven temporal distribution influence insights from our intended analyses? (Three years ago an estimate said that 10% of all existing photographs have been taken during the last 12 months! The temporally skewed distribution of Flickr photographs is a well-known fact.)

I let you add your own questions to the list. All these questions and their answers can have big, some, or no implications whatsoever – depending on the intended analysis*. I have compiled some interesting excerpts of the data that illustrate some ofthe above questions:

Some interesting locations in the Zurich Strava data (click for larger image)

upper-left: plausible gaps in the data in a no cycling zone near the tram stop Sternen Oerlikon (left) and the footprint of a bicycle race track including a small “warm-up round” (right).

top-right: plausible distribution of the GPS data around the main routes and highway near the university campus (left) and a conspicuous cluster of GPS data on the Büchner square of University of Zurich-Irchel: is this a meeting point for after-work mountain bikers for exploring the nearby hills? The starting point of a group of joggers? (right)

bottom-left: comparably few data points in the Niederdorf neighbourhood with no through-traffic (left) and remarkable gaps in the network near the Zurich art museum (might be caused by the visualisation choice) (right)

bottom-right: linear structures (likely GPS artefacts) in the forested area near Dolder Hotel.

These peculiarities can pose problems for certain applications; however, an interesting application that should be relatively robust towards these possible problems has been put forward by Strava itself: matching a coarsely defined track onto the underlying road network.

In general, the following applies: Be careful when using crowdsourced data and don’t be ensnared by great visualisations. Before using crowdsourced data you should always clarify above and similar questions with dedicated experts!

Generally it’s advisable that you try to have maximum control on the data production and collection process. If you rely on data provided by a third-party, please inform yourself about the processing steps that the data has undergone (such as sampling, filtering, removal of ‘outliers’, etc.). Only with this knowledge the fitness for use of a particular data set can be adequately judged.

In my opinion, if you observe these points, you are well on the way to successfully applying crowdsourced data!

Suggested reading: Timo Grossenbacher discusses representativity of crowdsourced data extensively in his Master’s thesis as well as in this blog post.

* Strava mentions that you might need different, more detailed data for certain analyses and they provide anonymised raw data under their Metro brand especially for such applications.

Reprinted from the Ernst Basler + Partner geoinformation blog under CC-BY-NC-SA.