April 18, 2014

avatar

Release Government Data, Early and Often

One of the key axioms of modern open government is that all public data should be published online in a raw but usable form. Usability in this case is aimed at software programmers. By making government datasets more usable, programmers are more likely to innovate in the civic sphere and build technologies, using the raw data, to enhance the relationships among citizens and with government.

The open government community has provided plenty of valuable guidance about what usability means for programmers. We proclaim that all datasets need to be: published in a format that is reasonably structured and machine-processable; well-documented; downloadable in bulk; authenticated using cryptographic digital signatures; version-controlled; permanent and citable; and the list goes on and on. These are all worthy principles to be sure, and all government datasets should strive to meet them.

But you’ll be hard-pressed to find any government datasets that exist with all of these principles pre-satisfied. While some are in better shape than others, most datasets would make programmers cringe. Data often only exist as informal working sets in proprietary Excel spreadsheets. Sometimes they are in structured databases, but schemas are undocumented, field values are ambiguous, and the semantics are only understood by the employee who created them. Datasets have errors and biases that are known but never explicitly corrected.

For a civil servant who is a data caretaker looking over the laundry list of publishing principles, there’s frequently a huge quality chasm between the dataset she owns and how people are asking to see it released. To her, publishing this data adequately just seems like a lot of extra work. The more attractive alternative is to put off the data publishing—it’s not in her job description or evaluations anyway—and move on to other work instead.

How can this chasm be bridged? A widely-adopted philosophy in software development and entrepreneurship would serve open government data well: release early and release often. And listen to your customers.

In the software development world, a working version of the product is pushed out as soon as possible even with known imperfections—an “alpha” release—so it can be subject to real use by early adopters. Early adopters can provide helpful feedback about what works, what’s broken, and what new features would be most useful to them. The software developers then iterate quickly. They incorporate the suggested fixes and features into their code and release an updated version of the product to their users. The virtuous cycle then starts again. Under this philosophy, software developers can be efficient about how to best improve their code where it matters, and users get software that works better and has more features they desire.

The “release early, release often” philosophy should be applied to government data. For the initial release, data caretakers should take the path of least resistance to get data out the door. This means publishing datasets in whatever format is most convenient, along with as much documentation as can reasonably be mustered. Documentation is especially important with an “alpha” dataset—proper warnings about its problems, instabilities and inductive limitations must be prominently displayed. (Of course, the usual privacy and legal caveats should also be applied.) Sometimes, the “alpha” release will be “good enough” for programmers to start their work, and this will minimize any superfluous work done by caretakers. This is the virtue of “release early.”

In other cases, programmers will need assistance using the dataset and will notice problem spots with the initial release. The dataset might be confusing, contain errors or be difficult to work with. A tight feedback mechanism allows the programmer to get help quickly and continue to innovate, while the data caretaker can fix problems based on real use cases and add clarifying metadata into an updated version of the dataset. Data quality and usability increases for those working with the dataset, both in and outside of government. That’s the virtue of “release often.”

And here is the big opportunity for government: no platform currently exists to engage the prime audience for government data—software programmers. Without a tight feedback mechanism, the virtuous cycle of mutual benefit cannot exist. Government is missing its best opportunity to improve data quality by neglecting useful feedback from programmers who are actually tinkering with the datasets. Society is losing out on potentially game-changing civic innovations, which otherwise would have been built if data were more usable and the uncertainty of failure reduced.

A terrific start in turning the corner would be for government to adopt an issue-tracking system for its datasets. As a public venue, it would help ensure that data caretakers are prompt in addressing developer concerns. It would also allow caretakers to organize feedback in a formal way. Such platforms are commonplace in any successful software development venture. The same needs to be true for government data in order to drive rapid quality improvements and increase developer engagement.

Comments

  1. Glyn says:

    What should developers do when he/she has provided feedback, for example that the data contains errors and then seemingly nothing happens? This is often the case. You start working with government data and discover that the data contains errors or there is a missing data set that is needed to work with the data. The most common response is yes we will get that for you, or we will set up a meeting …. followed by months of nothing happening.

    • Anonymous says:

      Glyn asks, “What should developers do when he/she has provided feedback, for example that the data contains errors and then seemingly nothing happens? This is often the case.”

      This is often the case when reporting bugs in software, too. Particularly if they’re bugs in Microsoft products.

      Any suggestions? :)

      • harlanyu says:

        That’s certainly a potential problem. Since the issue tracker would be a public forum, it would put at least some pressure on government entities to be reasonably responsive. Departments and agencies could also build open government into their process for evaluating those employees who are responsible for managing these datasets.

  2. WayneJ says:

    Issue tracking would be a terrific start. Thanks for promoting this. Without issue tracking information just gets lost. With a good system it’s easy to find the common complaints and good ideas get to the people that can do something about them. I can’t imagine doing software development without one any more, yet you hardly see them outside of development and support groups.