Editor's Note: GeoGig was originally called GeoGit. Why the change? Boundless VP of Marketing Rolando Peñate explained: "The reason we renamed the project GeoGig is two-fold: one is that the 'git' component of the name was flagged as a trademark concern as part of the LocationTech IP review and the other is that it doesn't actually use git as the underlying technology but is merely inspired by the core concepts of that tool. The joke around the office is that 'gig' means "GeoGig isn't git"."
Why would anyone want different versions of data? Don’t we all want data we can trust, that doesn’t vary from copy to copy? The answer is yes, sometimes. So, when do we want different versions? We often save different versions of data for historical purposes because data changes over time. Sometimes, a project requires that you compare different versions of data that happened concurrently or analyze different views of the same information to provide perspective. Some want unique versions of the data to prove a point or to foster discussions. So, it is clear that different versions of data can be extremely helpful, but to manage these versions, we need the right tools.
Traditionally, when working with large data sets, a group of people will work with and modify one version of a shared data repository while at the same time another group works with a different version of the same data. Both may be authoritative but the data has begun to fracture. Though versioning approaches have existed for some time, they are cumbersome and have provided a challenge in many workflows, especially those with multiple authors.
Following Typhoon Yolanda in the Philippines late last year, the OpenStreetMap community utilized open data to assist in the recovery effort. This effort, while highly rewarding and immensely helpful, typified the problem every project working with geospatial information eventually faces — the problem of managing change over time. At the center of the issue is data provenance: where the data originated, to whom it belongs, and what set of individual changes were made to a particular piece of information in order to reach its current state.
A preferred model is to utilize data in a peer-to-peer network where groups can exchange large modifications of data in an atomic way. This model eliminates the possibility of a single point of failure and also disrupts the notion that we need a single source of truth for geospatial information. This approach makes traditional geospatial people wary but there is a way to enforce authoritative data through the process without limiting it by the technology. With this model, we make it easier to collaborate on data and the sharing of geospatial data while keeping track of versioning. Boundless and other open source advocates brought this approach to life with GeoGig (formerly GeoGit).
GeoGig takes concepts and lessons learned from working with code in open source communities and applies them to managing geospatial information. GeoGig allows for decentralized management of versioned data and enables new and innovative workflows for collaboration. Users are able to track edits to geospatial information by importing raw data into repositories where they can view history, revert to older versions, branch into sandboxed areas, merge back in, and push to remote repositories.
Working with GeoGig
Once installed, a simple working session might look like this (data references are from the freely available Natural Earth collection):
1. Create a repository and import raw geospatial data (from Shapefiles and spatial databases such as PostGIS, Oracle Spatial or SQL Server):
GeoGig shp import ne_110m_coastline.shp
2. Add the imported data to the staging area. This command signals that this is information to be versioned and tracked and prepares it for final insertion into the repository.
3. Commit the information to the repository. Developers familiar with Git will appreciate the familiar API and command line options. In this case, we are passing a commit message that will be associated with this change.
GeoGig commit -m “Add coastline”
4. In order to make changes and collaborate with others, a typical workflow involves creating branches to isolate changes from the master branch. Creating a branch in GeoGig is as easy as issuing the following command:
GeoGig branch branch1
Branching in GeoGig (Click for larger image)
This creates a new branch called branch1 where all commits will go to until another branch is chosen. Branching is an important concept in GeoGig as it enables editors of geospatial content to modify information without worrying about interfering with the quality of the main version, usually stored in the master branch.
5. When changes are ready to be brought back into the main version, they can be merged into another branch using the merge command.
GeoGig checkout master (switches to the master branch)
GeoGig merge edits
Upon merging, a merge conflict is returned if conflicts are detected (for example, two users independently modify the same geometry with different outcomes) and a commit cannot happen until the conflict is resolved. This is an important feature that prevents geospatial data corruption and enforces workflows that involve data quality assurance.
Anyone familiar with tools like Git, which handles distributed version control for source code, will immediately see the advantages this approach brings.
GeoGig is an open source project based on the Java platform and is developed by committers across several organizations. It has recently been submitted as a project of the LocationTech working group within the Eclipse Foundation. GeoGig has also been designed to be extensible, and there is already a Python wrapper library that make these operations easier and enables automation.