Directions Magazine
Hello. Login | Register

Articles

All Articles | Post Comment

Treat Data As Code

Monday, October 7th 2013
Comments
Classified Ads:

Summary:

Ben Balter argues that open data today is exactly where open source was some two decades ago, and wants to see if it's possible to fast forward the community a bit. Imagine if every time the government posted a dataset, rather than posting the data as a zip file or to a proprietary data portal, the agency treated the data as open source. 

It’s time we start treating our data with the same love and respect that geeks treat their code. It’s time that we begin treating data as open source, not simply as something to be published.

Putting process on a pedestal

Geeks learned some two-decades ago that precision and transparency are everything. If so much as a single character is off, entire programs come crashing to a halt. It’s essential that developers can instantly discern exactly who made what change when. As a result, every change, whether proposed or realized is tracked and indexed with the highest level of granularity imaginable, and all this information is constantly exposed along side the software itself. It’s what makes open source open source.

Having access to a program’s underlying source code and the ability to see its revision history is only half the story though. At its core, open source is about building communities around shared challenges. Being able to track changes at that level of granularity and with that fidelity of decision pedigree empowers contributors to propose and discuss changes with great efficiently, accurately, and precisely. It makes software a team sport. All of a sudden line-by-line code reviews, issues, and pull requests arrise to address challenges both large and small. Simply put, do it right, and technology makes it easier to work together than to go it on your own.

Where open source was two decades ago

Things weren’t always this way, however. Originally source code was shared by passing around physical media, then email, and eventually zip or other compressed files posted to public servers. Questions and proposed improvements were transacted via email, and and were available only to the the project author. Questions were repeated, efforts were duplicated, and learning wasn’t shared. It didn’t leverage the power of the crowd. Sound familar?

I’d argue that open data today is exactly where open source was some two decades ago, and I’d love to see if we couldn’t fast forward the community a bit. Imagine if every time the government posted a dataset, rather than posting the data as a zip file or to a proprietary data portal, the agency treated the data as open source. All of a sudden data sets get a running log of known issues, and not just those known to the agency. Consumers of the data can submit proposed changed to do everything from normalizing columns to correcting errors to making the data itself more useable. Most importantly, as that data evolves over time, there’s a running log of exactly what’s changed, a critical feature in the regulatory context (e.g., what licenses were issued in the past week?).

Open sourcing data

We’re not talking about reinventing the wheel here. We’re talking about taking a proven practice in one industry, and introducing it to a related one. And from an agency perspective, it’s not a radical change either. Instead of FTPing static files to an agency server or updating a custom front-end, simply commit the file like the open source community would code. Heck, with GitHub for Windows/Mac, it’s a matter of drag, drop, sync. No command line or neck beard necessary.

All of a sudden we’re doing a few things: First, we’re empowering subject matter experts to be publishers. There’s no longer a Rube Goldberg machine necessary to publish data. Second, we’re starting a conversation between data publishers and data consumers. That’s where the issues and pull requests come into play. Finally, we’re exposing process, ensuring that open data becomes not simply “published data”, but can truly be open, dedicated community and all.

A package manager for government data

So why aren’t we there yet? For one, good old fashioned FUD. It’s hard enough to get data outside the firewall, let alone, to expose process along side it. For another, it’s a matter of tooling. Things like GeoJSON and CSV rendering go a long way to give open sourcing data a strong value proposition, but as long as it’s easier to do the wrong thing, that’s going to be the default. We need a prose.io for more data types; we need more geojson.io’s. Finally, it’s a matter of culture and education. The technology’s already there. That’s not the problem. But most data-publishers, researches, and subject-matter experts have never heard of version control or exposing process. It’s not in their blood. It’s simply not how things are done.

Imagine if the next iteration of Data.gov used CKAN to manage the metadata catalog, but rather than simply pointing to opaque and static zip files, excel files, PDFs, and other binary formats, instead took a play from the rubygems.orgplaybook, and provided a significant value add for data stored on GitHub (while still remaining fully backward compatible to any federated datastore). Imagine if when searching for a dataset on data.gov you not only had links to view collaboratively written documentation, browse outstanding issues, or submit proposed changes, but also had immediate access to an entire community of subject matter experts and like-minded data consumers, with whom you could interact directly. All of a sudden, the agency is no longer the single point of failure. We’re democratizing data.

The vision needs a few high-visbility wins, and more importantly, needs advocates and evangelists to take those wins back to those empowered to affect change but there’s nothing radical here, and definitely nothing that hasn’t already been done for longer than I’ve been interneting. How long will it be before you fork your first government dataset? Only time will tell, but one thing’s for sure: the data deserves it.

Reprinted from Ben Balter by Benjamin J. Balter under  Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Image by Libby Levi under CC-BY-SA 2.0.


Did you enjoy this topic? Check out these Channels:
Government, Open Source

Bookmark and Share


Stay Connected

Twitter RSS Facebook LinkedIn Delicious Apple Devices Android Blackberry






Recent Comments

Esri Development Center Program for Higher Education

Graduates of higher education programs in geographic information systems and science who can code software and build apps are highly sought after by employers. David DiBiase, Esri’s director of education, explains how the Esri Development Center (EDC) program confers special status and benefits upon a select few leading university departments that challenge their students to develop innovative applications based upon the ArcGIS platform....

Remapping the New Jersey coast after Hurricane Sandy
Modeling a Changing American Landscape
The American Community Survey in Action
Is Location Still Everything?
Putting Geography to Work in Healthcare
FOSS4G PDX Conference: Geospatial Technological Innovation and Diversity are Thriving
Data Mashups can Help Answer the World’s Biggest Questions
New high-resolution Satellite Image Analysis: 5 of 6 Syrian World Heritage Sites “Exhibit Significant Damage”

DirectionsMag.com

About Us | Advertise | Contact Us | Web Terms & Conditions | Privacy Policy
© 2014 Directions Media. All Rights Reserved