[This is the fifth and final post in a series on best practices for government datasets by Harlan Yu and me. (previous posts: 1, 2, 3, 4)]
For our final post in this series, we’ll discuss several issues not touched on by earlier posts, including data signing and the use of certain non-text file formats. The relatively brief discussions of these topics should not be interpreted as an indicator of their importance. The topics simply did not fit cleanly into earlier posts.
One significant omission from earlier posts is the issue of data signing with digital signatures. Before discussing this issue, let’s briefly discuss what a digital signature is. Suppose that you want to email me an IOU for $100. Later, I may want to prove that the IOU came from you—it’s of little value if you can claim that I made it up. Conversely, you may want the ability to prove whether the document has been altered. Otherwise, I could claim that you owe me $100,000.
Digital signatures help in proving the origin and authenticity of data. These signatures require that you create two related big numbers, known as keys: a private signing key (known only by you) and a public verification key. To generate a digital signature, you plug the data and your signing key into a complicated formula. The formula spits out another big number known a digital signature. Given the signature and your data, I can use the verification key to prove that the data came unmodified from you. Similarly, nobody can credibly sign modified data without your signing key—so you should be very careful to keep this key a secret.
Developers may want to ensure the authenticity of government data and to prove that authenticity to users. At first glance, the solution seems to be a simple application of digital signatures: agencies sign their data, and anyone can use the signatures to authenticate an agency’s data. In spite of their initially steep learning curve, tools like GnuPG provide straightforward file signing. In practice, the situation is more complicated. First, an agency must decide what data to sign. Perhaps a dataset contains numerous documents. Developers and other users may want signatures not only for the full dataset but also for individual documents in it.
Once an agency knows what to sign, it must decide who will perform the signing. Ideally, the employee producing the dataset would sign it immediately. Unfortunately, this solution requires all such employees to understand the signature tools and to know the agency’s signing key. Widespread distribution of the signing key increases the risk that it will be accidentally revealed. Therefore, a central party is likely to sign most data. Once data is signed, an agency must have a secure channel for delivering the verification key to consumers of the data—users cannot confirm the authenticity of signed data without this key. While signing a given file with a given key may not be hard, surrounding issues are more tricky. We offer no simple solution here, but further discussion of this topic between government agencies, developers, and the public could be useful for all parties.
Another issue that earlier posts did not address is the use of non-text spreadsheet formats, including Microsoft Excel’s XLS format. These formats can sometimes be useful because they allow the embedding of formulas and other rich information along with the data. Unfortunately, these formats are far more complex than raw text formats, so they present a greater challenge for automated processing tools. A comma-separated value (CSV) file is a straightforward text format that contains values separated by line breaks and commas. It provides an alternative to complicated spreadsheet formats. For example, the medal count from the 2010 Winter Olympics in CSV would be:
Country,Gold,Silver,Bronze,Total USA,9,15,13,37 Germany,10,13,7,30 Canada,14,7,5,26 Norway,9,8,6,23 ...
Fortunately, the release of data in one format does not preclude its release in another format. Most spreadsheet programs provide an option to save data in CSV form. Agencies should release spreadsheet data in a textual format like CSV by default, but an agency should feel free to also release the data in XLS or other formats.
Similarly, agencies will sometimes release large files or groups of files in a compressed or bundled format (for example, ZIP, TAR, GZ, BZ). In these cases, agencies should prominently specify where users can freely obtain software and instructions for extracting the data. Because so many means of compressing and bundling files exist, agencies should not presume that the necessary tools and steps are obvious from the data files themselves.
The rules suggested throughout this series should be seen as best practices rather than hard-and-fast rules. We are still in the process of fleshing out several of these ideas ourselves, and exceptional cases sometimes justify exceptional treatment. In unusual cases, an agency may need to deviate from traditional best practices, but it should carefully consider (and perhaps document) its rationale for doing so. Rules are made to be broken, but they should not be broken for mere expedience.
Our hope is that this series will provide agencies with some points to consider prior to releasing data. Because of Data.gov and the increasing traction of openness and transparency initiatives, we expect to see many more datasets enter the public domain in the coming years. Some agencies will approach the release of bulk data with minimal previous experience. While this poses a challenge, it also present an opportunity for committed agencies to institute good practices early, before bad habits and poor-quality legacy datasets can accumulate. When releasing new datasets, agencies will make numerous conscious and unconscious choices that impact developers. We hope to help agencies understand developers’ challenges when making these choices.
After gathering input from the community, we plan to create a technical report based on this series of posts. Thanks to numerous readers for insightful feedback; your comments have influenced and clarified our thoughts. If any FTT readers inside or outside of government have additional comments about this post or others, please do pass them along.
With regard to signing of government data to affirm authenticity, I would suggest that signatures should be reinforced by publishing index files containing the hashes of other government publications, and then publishing in some indelible media the hashes of those index files. If someone in the government were to release in 2013 a public key that had been used to sign some documents in 2009, the signatures on those documents would become worthless. On the other hand, if the government published a list of the hashes of all the documents signed from November 1, 2008 to October 31, 2009 and then printed the hash of that list in small type on all 2009 tax forms it sent out, it would be much harder for the government to deny that a copy of the list was indeed accurate as of November 1, 2009, and by extension, that any documents in that list were legit.
I’ve bumped into the last post in your series and backtracked to look at the others. I’d like to point at a significant ommision: the effective use of URLs for linking published data.
Links can play a valuable role in making data useful, by indicating when two references are to the same item. This can help both when items have multiple names (as for Burma versus Myanmar, or someone marries and changes their name) and when multiple items have the same name (when an article’s author is David Karger, is that me or the movie critic?).
Links become essential when a user wants to draw two pieces of published data together. Such “mashing up” is one of the most important ways to create new understanding of information (after all, if everything you want to know is contained in one data file, then some already knew it). Links tell the user when objects referenced in these two files are the same and should be merged, and when they are different and should carefully be kept separate.
You can consider two different kinds of links. URLs (uniform resource locators) take you to a web page that (hopefully) tells you more about the object being referenced. On the other hand, URIs (uniform resource indicators) may not resolve to any web page, but are simply abstract identifiers to tell you when two things are the same. URLs do offer a natural location to provide more information about an object, but URIs are adequate for the key role of information linkage.
As with your description of structure, proper linking is something that is relatively easy to do at the moment of authoring but quite challenging for a user to reconstruct later. And analogous to your earlier discussion of degrees of structure, there are also degrees of linking. With linking, the key question is how “universal” you make your link. There’s value in simply making links consistent within multiple documents that you publish, and even more value (but also more difficulty) in trying to use the same links as the rest of your group or organization. Some large organizations invest tremendous resources in creating huge ontologies that all members must follow, but I believe that any sort of “best effort” to find and use of preexisting object URLs from preexisting canonical sources such as Wikipedia or Freebase, will produce much of the benefit with a lot less work.
Putting links into XML documents isn’t hard. For example, one can borrow from RDFathe idea of using an “about” attribute in your XML, as in
<article>
<author about=”http://people.csail.mit.edu/karger/”>
David Karger
</author>
</article>
and so on. Alternatively, one can treat the linkss as their own elements
<article>
<author>
<name>
David Karger
</name>
<identifier>
http://people.csail.mit.edu/karger/
</identifier>
</author>
Either way, a user can pull out the information they need to properly connect different data sets.
I’m a big fan of CSV for the sole reason that it’s pretty much the most efficient uncompressed text-based format to store row/column based data. And theoretically unlimited in size. (And delimiters and stateful parsing are a miniscule hit compared to the rest of the idiosyncracies you have to deal with in the data itself.)
However, if people at some government agencies are more comfortable publishing straightforward xls tables (i.e. no charts, macros, or implying doc structure via cell formatting), I find this preferable over ad-hoc CSV — which may or may not convey all the information intended, or could have inconsistencies like poster #1 suggested — as there are many open source tools out there that make it easy to extract data from Excel files.
What’s more annoying is when they not only publish exclusively in pdf (better yet, self-extracting exe files), but also insist when asked that their data exists in no other format than pdf. I have really heard agency staff say this…
The several mentions of XLS are troubling. given that that’s a proprietary format (and worse, a Microsoft one). ODS (the native format used by OpenOffice Calc) should be preferred to XLS on principle (even though OO Calc will open XLS files) and to maximize compatibility. ODS is a spreadsheet format that will be usable on pretty much any system by any tech-savvy developer, since OO Calc should be.
In particular, XLS from a new enough version of Excel might not be readable by Calc until that version of the XLS format has been reverse engineered by Calc’s developers and a new version of Calc released. The alternative to using Calc for the developer would be using Excel (and Windows) and, thus, paying the Microsoft tax. Government should not be forcing third parties (e.g. developers using a data set) to give custom to a particular corporation (e.g. Microsoft) as a matter of principle, either deliberately or through negligence.
CSV is also unnecessarily complex, because commas appear in data often, which means that delimiters and stateful parsing are required, which increases complexity and processing time.
TSV is better because tabs don’t occur in data.
Brianary,
I’d thought about this point but used CSV primarily because the abbreviation is more well-known than TSV. CSV seems to be used frequently even when referring to TSV.
You have a good point here that agencies should think about their delimiters when using a format like CSV or TSV. Commas, tabs, and spaces can all have issues depending on the dataset, but any option is acceptable as long as it does not appear in the data.
-Joe
The biggest issue I’ve seen with CSV is that Excel tends to use ‘guess the format’ heuristics to figure out what type the data is, then assigns it that format without giving you the possibility of overriding. This gets real fun when you have a column of numbers in hexadecimal and you get 1293 staying as a number, 12C3 turning into text, and 12E3 turning into 12000 as it’s read as an exponential. A manual override would be nice. Worse, several of the heuristics seem to be based on the current regional/language settings, so the guesses won’t even be the same on different PCs.