[This is the third post in a series on best practices for government datasets by Harlan Yu and me. (previous posts)]
When the government releases a dataset, citizens ideally will discuss the contents and supply educated feedback. The ability to reference facts and figures in a dataset supports a constructive dialog. Vague concerns are harder to articulate and address than ones citing specific paragraphs in a document. In this post, we’ll discuss why data labeling supports this goal, and when and how government agencies should uniquely label data inside a dataset for citability. As in the previous post, our focus will be on XML, though the lessons apply to other formats.
As our interactions with each other and with our government increasingly occur online, the need for precise communication has also increased. Open-government initiatives can give knowledge and voices to more citizens than ever before, but this can lead to an almost overwhelming quantity of discussion. Various technologies can help us to manage and make sense of this information, but these technologies are most effective with unambiguous data. For example, tools could sort citizens’ comments on a bill by section, but this task can be difficult unless the comments cite sections. One way to encourage citations is by placing tags in the dataset that citizens and open-government tools can easily reference.
The structure of XML implicitly enables referencing of elements in a sense. A citizen could cite the seventh “<PARARGRAPH>” element in the twenty-eighth “<DOCUMENT>” element in a dataset. Even ignoring how error-prone counting is for humans, reliance on this structure is not ideal. XML schemas can specify order for elements of different types but not the same type—a parser could validly retrieve <PARAGRAPH> elements of a document in any order (we’ll discuss in our next post why labels and ordering should be treated as two separate problems; our point here is only that element order should not be used as an implicit label). In addition, different parties may come up with different reference schemes in the absence of an explicit authoritative one. The agency creating a dataset might refer to the paragraph referenced above as Section XII of Document K6-2495, and another developer might refer to it as “<PARAGRAPH>” 147. An abundance of reference schemes can make it harder for government officials to understand citizens, harder for citizens to understand each other, and harder for developers to merge the function and output of their tools. Using an explicit common reference scheme avoids these issues.
Of course, different uses require different forms of labeling, and agencies cannot meet the desires of everyone. How can they decide where to add labels? Recall that our previous posts address the question of who should add what structure to a dataset. Agencies should use the answer as a guide for where to add labels, generally adding labels to all elements they create. If an agency breaks text up by paragraph, each paragraph should be citable; if it breaks text up by sentence, each sentence should be citable. Labels are fairly straightforward to add to elements in XML, so this rule imposes minimal additional work on agencies. Additional partitioning and labeling of data can be left to private parties. Some precedence already exists for private party involvement here: Citability.org is working to enable citation of government documents at a paragraph level.
When agencies add labels, they should strive to use the same reference schemes used internally. Unfortunately, labeling schemes utilizing Roman numerals, letters, or almost anything other than Arabic numerals (0, 1, 2, etc.) can be hard to process. For these cases, the agency should include two labels: an internal agency label and a numeric label. While this suggestion runs counter to our rule against redundancy, it makes the labels far easier to process and facilities easy translation between both schemes.
In general, however, the lessons from past posts should be kept in mind when labeling, including the points about avoiding redundancy: the label for Part 2 of a document should appear in element names and attributes (e.g., “<PART LABEL="2">[…]</PART>”) rather than text. Labels should uniquely identify an element among those with the same parent, but a label may not be necessary if an element’s type is unique among its siblings.
To make these recommendations more concrete, we end with an example. Consider the following document:
Notice 2982: Proposal to Increase Public Transit Fees Section I. Budget Shortfall In fiscal year 2009, [...] Unless changes are made [...] Section II. Decreasing the Deficit To compensate for [...] This relatively modest [...]
This document could be represented in a dataset as:
<DATASET> [...] <NOTICE LABEL="2982"> <TITLE>Proposal to Increase Public Transit Fees</TITLE> <SECTION AGENCY_LABEL="I" LABEL="1"> <TITLE>Budget Shortfall</TITLE> <PARAGRAPH LABEL="1">In fiscal year 2009, [...]</PARAGRAPH> <PARAGRAPH LABEL="2">Unless changes are made [...]</PARAGRAPH> </SECTION> <SECTION AGENCY_LABEL="II" LABEL="2"> <TITLE>Decreasing the Deficit</TITLE> <PARAGRAPH LABEL="1">To compensate for [...]</PARAGRAPH> <PARAGRAPH LABEL="2">This relatively modest [...]</PARAGRAPH> </SECTION> </NOTICE> [...] </DATASET>
Among other things, we can uniquely reference the notice (Notice 2982) and each paragraph (e.g., Notice 2982, Section II, paragraph 1).
In our next post, we’ll discuss how agencies can handle errors and make other changes while reducing the strain on developers.