April 23, 2014


Labeling Dataset Contents

[This is the third post in a series on best practices for government datasets by Harlan Yu and me. (previous posts)]

When the government releases a dataset, citizens ideally will discuss the contents and supply educated feedback. The ability to reference facts and figures in a dataset supports a constructive dialog. Vague concerns are harder to articulate and address than ones citing specific paragraphs in a document. In this post, we’ll discuss why data labeling supports this goal, and when and how government agencies should uniquely label data inside a dataset for citability. As in the previous post, our focus will be on XML, though the lessons apply to other formats.

As our interactions with each other and with our government increasingly occur online, the need for precise communication has also increased. Open-government initiatives can give knowledge and voices to more citizens than ever before, but this can lead to an almost overwhelming quantity of discussion. Various technologies can help us to manage and make sense of this information, but these technologies are most effective with unambiguous data. For example, tools could sort citizens’ comments on a bill by section, but this task can be difficult unless the comments cite sections. One way to encourage citations is by placing tags in the dataset that citizens and open-government tools can easily reference.

The structure of XML implicitly enables referencing of elements in a sense. A citizen could cite the seventh “<PARARGRAPH>” element in the twenty-eighth “<DOCUMENT>” element in a dataset. Even ignoring how error-prone counting is for humans, reliance on this structure is not ideal. XML schemas can specify order for elements of different types but not the same type—a parser could validly retrieve <PARAGRAPH> elements of a document in any order (we’ll discuss in our next post why labels and ordering should be treated as two separate problems; our point here is only that element order should not be used as an implicit label). In addition, different parties may come up with different reference schemes in the absence of an explicit authoritative one. The agency creating a dataset might refer to the paragraph referenced above as Section XII of Document K6-2495, and another developer might refer to it as “<PARAGRAPH>” 147. An abundance of reference schemes can make it harder for government officials to understand citizens, harder for citizens to understand each other, and harder for developers to merge the function and output of their tools. Using an explicit common reference scheme avoids these issues.

Of course, different uses require different forms of labeling, and agencies cannot meet the desires of everyone. How can they decide where to add labels? Recall that our previous posts address the question of who should add what structure to a dataset. Agencies should use the answer as a guide for where to add labels, generally adding labels to all elements they create. If an agency breaks text up by paragraph, each paragraph should be citable; if it breaks text up by sentence, each sentence should be citable. Labels are fairly straightforward to add to elements in XML, so this rule imposes minimal additional work on agencies. Additional partitioning and labeling of data can be left to private parties. Some precedence already exists for private party involvement here: Citability.org is working to enable citation of government documents at a paragraph level.

When agencies add labels, they should strive to use the same reference schemes used internally. Unfortunately, labeling schemes utilizing Roman numerals, letters, or almost anything other than Arabic numerals (0, 1, 2, etc.) can be hard to process. For these cases, the agency should include two labels: an internal agency label and a numeric label. While this suggestion runs counter to our rule against redundancy, it makes the labels far easier to process and facilities easy translation between both schemes.

In general, however, the lessons from past posts should be kept in mind when labeling, including the points about avoiding redundancy: the label for Part 2 of a document should appear in element names and attributes (e.g., “<PART LABEL="2">[...]</PART>”) rather than text. Labels should uniquely identify an element among those with the same parent, but a label may not be necessary if an element’s type is unique among its siblings.

To make these recommendations more concrete, we end with an example. Consider the following document:

  Notice 2982:  Proposal to Increase Public Transit Fees

  Section I.  Budget Shortfall
  In fiscal year 2009, [...]
  Unless changes are made [...]

  Section II.  Decreasing the Deficit
  To compensate for [...]
  This relatively modest [...]

This document could be represented in a dataset as:

  <NOTICE LABEL="2982">
    <TITLE>Proposal to Increase Public Transit Fees</TITLE>
      <TITLE>Budget Shortfall</TITLE>
      <PARAGRAPH LABEL="1">In fiscal year 2009, [...]</PARAGRAPH>
      <PARAGRAPH LABEL="2">Unless changes are made [...]</PARAGRAPH>
      <TITLE>Decreasing the Deficit</TITLE>
      <PARAGRAPH LABEL="1">To compensate for [...]</PARAGRAPH>
      <PARAGRAPH LABEL="2">This relatively modest [...]</PARAGRAPH>

Among other things, we can uniquely reference the notice (Notice 2982) and each paragraph (e.g., Notice 2982, Section II, paragraph 1).

In our next post, we’ll discuss how agencies can handle errors and make other changes while reducing the strain on developers.


  1. Kevin Marks says:

    For a document of the type you’re talking about, use HTML. There is no excuse for making up bad random tags like that. If you insist on XML, use ePub, which is HTML disguised as XML to placate you.

    Now go and read Tim Bray’s Don’t Invent XML Languages essay.

  2. Adhemar says:

    I am not at all sure that the slightly higher difficulty of parsing non-numeric labeling schemes is a convincing reason to introduce two labeling schemes: the existing human-readable one and a new numeric-only one.

    I would suggest to governments and other organisations to try to fit their existing labeling schemes in the Uniform Resource Locator/Identifier (URL/URI) hierarchy.

    For example:Notice 2982 could be published at the URL http ://example.gov/agency/notices/2982Section II thereof could be assigned the URL http ://example.gov/agency/notices/2982/II
    and/or fragment URL http ://example.gov/agency/notices/2982#section-II
    (using a fragment identifier in the bigger document of the notice)The first paragraph thereof could be assigned the URL http ://example.gov/agency/notices/2982/II/1
    and/or fragment URL http ://example.gov/agency/notices/2982#section-II-par-1

    (Ignore the space in http ://; it was necessary to convince the spam filter.)

    The use of the URI hierarchy has as added benefit that one can always refer to the notice, section or paragraph with its universally unique identifier (the absolute URI); but when the context is clear (locally in the document or with an explicitly set base URI) one can also use relative paths.

    • jcalandr says:

      While I’m rethinking the suggestions on numeric labels, I’d actually like to reaffirm the suggestion of separating agency labels from general labels and go a bit farther. I’m now thinking that all elements in a dataset should be uniquely labeled within that dataset and should always have a label that is separate from the agency label. As I’ll discuss in the next post, the agency labels may change as errors are corrected–Section X may become Subsection Y. Consistency in labels can really help developers here, and the agency label will not necessarily supply that consistency. Aliases can allow multiple schemes to map to the same document, so an agency label and a general label can peacefully coexist even if you want to publish documents in a manner that makes them web accessible based on the label.

      Though I’m not yet convinced that the URL/URI approach is the generally the right one (I’m not arguing that it’s the wrong one, just still balancing pros and cons of the options), I have reconsidered the point about numeric labels. Some of my original rationale for strictly numeric labels has been rendered moot based on other points that we have or will make.


      • Anonymous says:

        URLs have an even bigger problem: long-term stability. This goes deeper than section labels and whatnot — the XML schema link at the start of the XML file will eventually succumb to link rot, and then what happens?

        The long-term solution is clear: we need a type of truly permanent URL, and that means we have to get away from the current system where part of a URL specifies a particular network host as storage site. The identity of a document must become divorced technically, as it already is divorced in fact, from the storage location(s) of a document.

        I propose a three-layer system. At the bottom, we have domain names, IP addresses, HTTP URLs, and suchlike as we use now; but we add two layers of indirection.

        The first of these uses a new URI scheme, say object://SHA-1/hash-of-object or similarly, and relies on a distributed lookup system that can translate these into first-layer addresses. Google is well-positioned to develop such a lookup system; they already spider much of the web and can add hashing of the files they find and the ability to resolve hash-based links to underlying links. Other search engines could do likewise, and the marketplace would eventually have multiple Object Location Servers for this layer, as it has many Domain Name Servers and registrars for the layer below.

        The final layer would also involve a DNS-like system, and like the bottom layer’s DNS it would involve human-granted names (and thus, some eventual ICANN-like authority to prevent name collisions). This would supply a translation service between yet another URI scheme, say document://human-readable-name, to object:// URIs.

        Note that these latter two are truly URIs, resource *identifiers* rather than *locators*.

        Note additionally that once such a system is in place (or even just layer two), rot-resistant links will make it much more feasible to move or even scatter content around and to mirror it readily. The system resembles some of the schemes used internally by P2P systems and this is not coincidence; ultimately the bottom layer should be replaceable with an actual P2P system, whereupon a bunch of new effects kick in, such as that not only are links rot-resistant but Slashdot-resistant (a massive surge in traffic will spread copies far and wide and consequently cause the hosting capacity to automatically scale with demand). Hosting becomes cheap: a business-class broadband network connection plus a fat disk drive plus making part of it visible to remote user agents (including Googlebot) equals hosting. Some of the traffic will come to you; over time more and more of it will go to other machines whose browsers have cached the content (given future browsers that have fat, long-lived caches and make these visible over the ‘net).

        The last is a bit of a privacy concern: if caches are remote-viewable, people can figure out what you’ve been reading. One extreme solution to this is if society adjusts enough as the MySpace generation grows up that privacy becomes moot — nobody looks down on anyone based on what they view/browse/read and societal expectations of various sorts have changed away from assorted forms of shame, prudishness, and the like. (As for transactions online, replace systems dependent on shared secret numbers with PKI.) At the other extreme caches are made opaque but still usable using fancy encryption tricks and onion routing. This makes the whole system slower and less reliable, though. (Freenet is an attempt at developing such a system, which could serve as a replacement layer 1 as well as providing layer 2 since it’s hash-addressable). In between one might have privacy settings in browsers, deciding what to make available from what you download and what to hide from the outside world.

        Oh, and what about copyright?

        To hell with copyright. Its day is done and the development of some system like the above is inevitable and will inevitably hasten copyright’s demise. There might be a brief flourishing of CC licenses for a while but eventually those will become irrelevant too. Attibution will be easy to prove and hard to fake in a hyper-online future of hyper-findable documents; already Google has made getting away with plagiarism very difficult. ShareAlike will be pretty much the default as copyright withers. NoDerivs serves no useful purpose given that originals can be distinguished from modified versions, and that boils down to attribution, which was already dealt with. And Noncommercial is nonsense. It’s already a PITA — is using a copy on a site that has ads “commercial”? There’s no clear line between commercial use and non as it stands, and the only real reason for the author to care is to try to extract rents from commercial users, which won’t work anyway for much longer; there will be a glut of similar, largely-substitutable and easily-findable content, increasingly sophisticated search tools, and the like, so it will be easy to shop around for the lowest price and that lowest price will inevitably trend to zero. At that point copyright becomes a dead letter.

      • Adhemar says:

        Ideally, one wants URIs to be both human-readable (simple), stable and manageable.

        However, the human-readability and stability requirements are conflicting. Numeric labels are more stable, but less readable. Textual labels (titles, concepts, …) are more readable but titles and vocabulary may change over time. This shouldn’t happen too often (Cool URIs don’t change!) But if it happens, all the applications with links using the earlier URI should not break.

        Luckily, URI mechanisms provide Temporary Redirects (HTTP 307) and Permanent Moves (HTTP 301). Additionally, if you use a URI for a non-information resource (a thing), you can use See Other (HTTP 303) to link to an information resource about the non-information resource.

        Redirection solves the problem of changing titles. Your section-X-becoming-subsection-Y problem might be solved this way too. However, straight-forward redirection does not allow reusing earlier vocabulary.

        That’s why Tim Berners-Lee and others argue for versioned labels (Tim Berners-Lee prefers adding the year when the term is introduced). See http://www.jenitennison.com/blog/node/112 .

  3. Anonymous says:

    Was there some kind of intentional comment, perhaps about machine-readable versus human-readable tags or the possibility of typos screwing up data, when you wrote “PARARGRAPH” (with an extra ‘R’) for the first XML tag?