[This is the first post in a series on best practices for government datasets by Harlan Yu and me.]
There’s a growing consensus that the government can increase its openness and transparency by publishing its raw data in bulk online. As several Freedom to Tinker contributors argued in Government Data and the Invisible Hand, publishing data empowers third party software developers to produce innovative new technologies that engage citizens and illuminate government’s inner workings. With the establishment of Data.gov and the federal Open Government Initiative, federal agencies are quickly embracing a culture of machine-readable data release, and many states and municipalities are now following their lead.
But how usable are these datasets for developers? The answer lies primarily in the structure and contents of the datasets themselves. While all data in digital form is technically machine-readable in some sense, the ease of use for machine-readable datasets can vary widely. In fact, machine-readability is just a baseline requirement: a developer can’t start to work with a dataset until it’s in this form. Once that minimum standard is met, the critical factor is how easy it is for developers to use the dataset in new, innovative ways.
In this series of posts, we’ll draw on our experience building applications that use government data to offer some thoughts about best practices government could follow in releasing data. By taking a few straightforward steps in preparing its datasets, government can make the data much more useful to developers.
One key factor in determining ease of use for developers is the structure of the dataset, and that is the topic of our first post. Let’s start with a trivial example:
<BOOK>A Tale of Two Cities by Charles Dickens. Chapter 1. The Period. It was the best of times, it was the worst of times [...] The end.</BOOK>
This is a “well-formed” XML version of Dicken’s “A Tale of Two Cities” in its entirety. Though more usable than a PDF copy of the book, the XML document lacks basic structure and is not particularly helpful to a developer building tools to display or analyze the book. Compare that to:
<BOOK> <HEADER> <TITLE>A Tale of Two Cities</TITLE> <AUTHOR>Charles Dickens</AUTHOR> </HEADER> <BODY> <CHAPTER NUMBER="1"> <TITLE>The Period</TITLE> <PARAGRAPH NUMBER="1"> <SENTENCE NUMBER="1">It was the best of times [...]</SENTENCE> </PARAGRAPH> [...] </CHAPTER> [...] </BODY> </BOOK>
This data is far more structured, and a developer can take it and immediately do lots of new things. If the developer plans to build an interface for a new e-book reader for instance, it’s easy to extract the component parts of the book for appropriate formatting. With the less-structured version, the developer needs to guess where chapters, titles, and paragraphs begin and end. Because manual analysis is infeasible for large, complex datasets, developers who have only minimally-structured data will need to build automated processing scripts to make these guesses. Developing these scripts can be difficult and time-consuming, and data quality will suffer because the scripts will inevitably make mistakes.
Whether a dataset facilitates innovative uses by developers is not a yes or no question but a matter of degree, and it depends largely on the quality of the data’s structure and the needs of specific developers. In deciding what structure to add, agencies should consider who is in the best position to add various types of structure to the data. Sometimes, the agency is in the best position. Employees of an agency may amass specialized knowledge about the data, or the agency may already internally store the data with structural details like explicit database columns. In these cases, the agency can provide this structure with little effort, relieving developers from the potentially Herculean task of reconstructing these details. In other cases, the agency may have no significant advantage over private parties.
Agencies should get as close to this dividing line as is reasonably possible to broaden the range of creative possibilities for application developers. The goal is to minimize structural obstacles that might prevent developers from tinkering with the data. Better structure leads to more innovative tools, a more transparent government, and a greater appreciation for the work done by federal agencies.
Over our next several posts, we’ll discuss choices that agencies make when releasing datasets and the ways these choices affect developers. Among other things, we’ll explore basic data format lessons, data labeling, and correction/modification of datasets. Our goal is to turn this series into a best practices white paper for government use, and we’d appreciate any comments, suggestions, or insights from readers.
Interesting post. I’m looking forward to this series.
It reminded me of a recent paper at the Computer Supported Cooperative Work conference on API usability. It helps to have reviews and best practices to help developers who are engrossed in a system to make a useful interface for outsiders.
This is a very similar problem to that faced by scientists in the various e-science and scientific cyberinfrastructure efforts to share scientific data. The data may be published, often by mandate, but is it usable by other scientists? Is publishing an unfunded mandate? How will we encourage time, money, and attention to be given to really make the data useful? How do you adequately describe the data so that someone with different training and background, and perhaps from a different community, can make sense of it? I.e., what sort of metadata will accompany each dataset?
The examples you give is a simple one. I look forward to your descriptions of more complicated and realistic situations which give demographic information, that needs a description of what counts for “hispanic.” Or provide data about budgets, but needs to find some way of communicating information about the accounting system.
Making sense of data is often more than a matter of technical interpretation. In practice, it often ends up requiring communication with the creators of the data. I hope that open government makes room for that work as well.