November 21, 2024

Archives for March 2010

Labeling Dataset Contents

[This is the third post in a series on best practices for government datasets by Harlan Yu and me. (previous posts)]

When the government releases a dataset, citizens ideally will discuss the contents and supply educated feedback. The ability to reference facts and figures in a dataset supports a constructive dialog. Vague concerns are harder to articulate and address than ones citing specific paragraphs in a document. In this post, we’ll discuss why data labeling supports this goal, and when and how government agencies should uniquely label data inside a dataset for citability. As in the previous post, our focus will be on XML, though the lessons apply to other formats.

As our interactions with each other and with our government increasingly occur online, the need for precise communication has also increased. Open-government initiatives can give knowledge and voices to more citizens than ever before, but this can lead to an almost overwhelming quantity of discussion. Various technologies can help us to manage and make sense of this information, but these technologies are most effective with unambiguous data. For example, tools could sort citizens’ comments on a bill by section, but this task can be difficult unless the comments cite sections. One way to encourage citations is by placing tags in the dataset that citizens and open-government tools can easily reference.

The structure of XML implicitly enables referencing of elements in a sense. A citizen could cite the seventh “<PARARGRAPH>” element in the twenty-eighth “<DOCUMENT>” element in a dataset. Even ignoring how error-prone counting is for humans, reliance on this structure is not ideal. XML schemas can specify order for elements of different types but not the same type—a parser could validly retrieve <PARAGRAPH> elements of a document in any order (we’ll discuss in our next post why labels and ordering should be treated as two separate problems; our point here is only that element order should not be used as an implicit label). In addition, different parties may come up with different reference schemes in the absence of an explicit authoritative one. The agency creating a dataset might refer to the paragraph referenced above as Section XII of Document K6-2495, and another developer might refer to it as “<PARAGRAPH>” 147. An abundance of reference schemes can make it harder for government officials to understand citizens, harder for citizens to understand each other, and harder for developers to merge the function and output of their tools. Using an explicit common reference scheme avoids these issues.

Of course, different uses require different forms of labeling, and agencies cannot meet the desires of everyone. How can they decide where to add labels? Recall that our previous posts address the question of who should add what structure to a dataset. Agencies should use the answer as a guide for where to add labels, generally adding labels to all elements they create. If an agency breaks text up by paragraph, each paragraph should be citable; if it breaks text up by sentence, each sentence should be citable. Labels are fairly straightforward to add to elements in XML, so this rule imposes minimal additional work on agencies. Additional partitioning and labeling of data can be left to private parties. Some precedence already exists for private party involvement here: Citability.org is working to enable citation of government documents at a paragraph level.

When agencies add labels, they should strive to use the same reference schemes used internally. Unfortunately, labeling schemes utilizing Roman numerals, letters, or almost anything other than Arabic numerals (0, 1, 2, etc.) can be hard to process. For these cases, the agency should include two labels: an internal agency label and a numeric label. While this suggestion runs counter to our rule against redundancy, it makes the labels far easier to process and facilities easy translation between both schemes.

In general, however, the lessons from past posts should be kept in mind when labeling, including the points about avoiding redundancy: the label for Part 2 of a document should appear in element names and attributes (e.g., “<PART LABEL="2">[…]</PART>”) rather than text. Labels should uniquely identify an element among those with the same parent, but a label may not be necessary if an element’s type is unique among its siblings.

To make these recommendations more concrete, we end with an example. Consider the following document:

  Notice 2982:  Proposal to Increase Public Transit Fees

  Section I.  Budget Shortfall
  In fiscal year 2009, [...]
  Unless changes are made [...]

  Section II.  Decreasing the Deficit
  To compensate for [...]
  This relatively modest [...]

This document could be represented in a dataset as:

<DATASET>
  [...]
  <NOTICE LABEL="2982">
    <TITLE>Proposal to Increase Public Transit Fees</TITLE>
    <SECTION AGENCY_LABEL="I" LABEL="1">
      <TITLE>Budget Shortfall</TITLE>
      <PARAGRAPH LABEL="1">In fiscal year 2009, [...]</PARAGRAPH>
      <PARAGRAPH LABEL="2">Unless changes are made [...]</PARAGRAPH>
    </SECTION>
    <SECTION AGENCY_LABEL="II" LABEL="2">
      <TITLE>Decreasing the Deficit</TITLE>
      <PARAGRAPH LABEL="1">To compensate for [...]</PARAGRAPH>
      <PARAGRAPH LABEL="2">This relatively modest [...]</PARAGRAPH>
    </SECTION>
  </NOTICE>
  [...]
</DATASET>

Among other things, we can uniquely reference the notice (Notice 2982) and each paragraph (e.g., Notice 2982, Section II, paragraph 1).

In our next post, we’ll discuss how agencies can handle errors and make other changes while reducing the strain on developers.

Basic Data Format Lessons

[This is the second post in a series on best practices for government datasets by Harlan Yu and me. (previous post)]

When creating a dataset, the preferences of developers may not be obvious to those producing the dataset. Seemingly innocuous choices by data providers can lead to major headaches for developers. In this post, we discuss some of the more basic challenges that developers encounter when working with a dataset. These lessons may seem trivial to our more technical readers, but they’re often learned through experience. Our hope is to reduce this learning curve by explaining how various practices affect developers. We’ll focus on XML datasets, but many of the topics apply to CSV and other data formats.

One of the hardest parts of working with a dataset can be figuring out what’s in it and how it’s organized. What data comes inside an “<FL47>” tag? Can a “<TEXT>” element ever contain a “<PARAGRAPH>” element? Developers rely heavily on documentation to explain the structure and contents of a dataset. When working with XML, one particularly relevant item is known as a schema. An XML schema is a separate file with an extension such as “.dtd” or “.xsd,” and it provides a blueprint of the permitted structure for corresponding XML files. XML schema files tell developers where they can recover the information that they need from a dataset. These schema files and other documentation are often a necessity for developers, and they should be treated as such by data providers. Any XML file supplied by an agency should contain a complete URL address at which its schema can be found. Further, any link to an XML document on a government site should have prominent links near it for the corresponding schema file and reasonable documentation describing the contents of the dataset.

XML schema files can be seen as an informal contract between data providers and developers, effectively promising that a dataset will match the specified structure. Unfortunately, sometimes datasets contain flaws causing them not to match that structure. Although experienced developers produce software that detects the existence of structural errors, these errors can be difficult or impossible for them to isolate and correct. The people in the best position to catch and fix structural errors are the people producing a dataset. Numerous validation tools exist for ensuring that an XML document is well-formed and valid—that is, the document is structurally sound and matches its XML schema. Prior to releasing a dataset, an agency should run a validator on it to check for structural flaws. This sanity check can take just a few moments for an agency but save hours of developer time.

When deciding on the structure of a dataset, an agency should strive for simplicity while logically representing the underlying data. The addition of elements, attributes, or children in a schema can improve the quality and clarity of the dataset, but it can also add unnecessary complexity. When designing schemas, there’s a tendency to include elements or other structure that will almost certainly go unused in practice. Schema designers may assume that extraneous items do no harm, but developers must cautiously account for them if allowed by schema. The result can be wasted developer time and increased software complexity. The true cost of various structural choices is not just the time necessary to encode these choices in a schema but also the burden these choices impose on developers. Additional structural complexity must provide a justifiable benefit.

In some cases, however, the addition of elements or attributes is not only justifiable but highly desirable for developers: logically distinct pieces of data should appear in separate XML elements or attributes. Suppose that a developer wishes to access a piece of data in a dataset. If the data is combined with other information, the developer will need to figure out how to extract it from the combined field. This extraction can be difficult, time-consuming, and prone to errors. For example, assume that a data provider includes the following element:

<DOCINFO>Doc No. 2001345--Released 01-01-2001</DOCINFO>

To extract the document number, a developer might look for all characters following “No.” but before a dash. While this is straightforward enough, other parts of the same or future datasets might instead use the document number format “2001-345” or separate the document number and release date with a space rather than a double-dash. Neither case would lead to invalid XML, but both would break the developer’s extraction tool. Now consider this alternative:

<DOCINFO>
  <DOC_NO>2001345</DOC_NO>
  <RELEASE_DATE>01-01-2001</RELEASE_DATE>
</DOCINFO>

Using extra elements to separate logically distinct data can prevent extraction errors. This lesson often applies even when the combined data is related. For example, the version number 5.3.2 could be broken into major version 5, minor version 3, and revision 2. In general, agencies should separate such items themselves when they can do so more easily than developers.

Even when the basic structure of a dataset is ideal, choices about how to provide data inside this structure can affect developers. Developers thrive on consistency. Suppose that a dataset details various costs. Consider all possible ways of writing cost: $4,300, 5938.37, 74 dollars and 63 cents, etc. Unless an agency decides on, documents, and adheres to a standard format, developers’ software must handle a large number of possibilities to avoid unexpected surprises. Consistency in a dataset can make a developer’s life far easier, and it reduces the possibility that surprises will break an application. Note that a schema can be helpful for enforcing consistency for certain fields—for example, cost might be defined as a decimal field with a constraint on the number of fractional digits.

Redundant information is another source of difficulty for developers. Redundancy can appear in numerous ways. Suppose that a dataset contains the element “<VERSION>Version 5</VERSION>.” The word “Version” is unnecessary, and developers must go through additional trouble to extract the version number. In so doing, developers must consider the possibility that “Version” could be misspelled, abbreviated, or omitted. Supplying a version number alone (“<VERSION>5</VERSION>”) would avoid this issue altogether. More subtly, suppose that a dataset contains all bills introduced in Congress on a certain date:

<INTRODUCED_BILLS>
  <DATE>11-12-2014</DATE>
  <HOUSE_BILLS DATE="NOV 12, 2014">
    [...]
  </HOUSE_BILLS>
  <SENATE_BILLS DATE="NOV 12, 2014">
    [...]
  </SENATE_BILLS>
</INTRODUCED_BILLS>

Date information appears three times even though it must be the same in all cases. The more often a piece of information appears in a dataset, the more likely that inconsistencies will occur. These inconsistencies can lead to software errors requiring manual resolution. While redundancy can serve as a sanity check for errors, agencies typically should perform this check themselves if possible before releasing the data. After all, the agency is in the best position to fix inconsistencies. Unless well-justified, agencies should avoid redundancy.

Processing datasets often requires a significant amount of developer time, so adherence to even basic rules can dramatically increase innovation. What other low-level recommendations do FTT readers have for non-developers producing datasets?

Tomorrow, we’ll discuss how labeling elements in a dataset can help developers.

Government Datasets That Facilitate Innovation

[This is the first post in a series on best practices for government datasets by Harlan Yu and me.]

There’s a growing consensus that the government can increase its openness and transparency by publishing its raw data in bulk online. As several Freedom to Tinker contributors argued in Government Data and the Invisible Hand, publishing data empowers third party software developers to produce innovative new technologies that engage citizens and illuminate government’s inner workings. With the establishment of Data.gov and the federal Open Government Initiative, federal agencies are quickly embracing a culture of machine-readable data release, and many states and municipalities are now following their lead.

But how usable are these datasets for developers? The answer lies primarily in the structure and contents of the datasets themselves. While all data in digital form is technically machine-readable in some sense, the ease of use for machine-readable datasets can vary widely. In fact, machine-readability is just a baseline requirement: a developer can’t start to work with a dataset until it’s in this form. Once that minimum standard is met, the critical factor is how easy it is for developers to use the dataset in new, innovative ways.

In this series of posts, we’ll draw on our experience building applications that use government data to offer some thoughts about best practices government could follow in releasing data. By taking a few straightforward steps in preparing its datasets, government can make the data much more useful to developers.

One key factor in determining ease of use for developers is the structure of the dataset, and that is the topic of our first post. Let’s start with a trivial example:

<BOOK>A Tale of Two Cities by Charles Dickens. Chapter 1. The Period. It was the best of times, it was the worst of times [...] The end.</BOOK>

This is a “well-formed” XML version of Dicken’s “A Tale of Two Cities” in its entirety. Though more usable than a PDF copy of the book, the XML document lacks basic structure and is not particularly helpful to a developer building tools to display or analyze the book. Compare that to:

<BOOK>
  <HEADER>
    <TITLE>A Tale of Two Cities</TITLE>
    <AUTHOR>Charles Dickens</AUTHOR>
  </HEADER>
  <BODY>
    <CHAPTER NUMBER="1">
      <TITLE>The Period</TITLE>
      <PARAGRAPH NUMBER="1">
        <SENTENCE NUMBER="1">It was the best of times [...]</SENTENCE>
      </PARAGRAPH>
      [...]
    </CHAPTER>
    [...]
  </BODY>
</BOOK>

This data is far more structured, and a developer can take it and immediately do lots of new things. If the developer plans to build an interface for a new e-book reader for instance, it’s easy to extract the component parts of the book for appropriate formatting. With the less-structured version, the developer needs to guess where chapters, titles, and paragraphs begin and end. Because manual analysis is infeasible for large, complex datasets, developers who have only minimally-structured data will need to build automated processing scripts to make these guesses. Developing these scripts can be difficult and time-consuming, and data quality will suffer because the scripts will inevitably make mistakes.

Whether a dataset facilitates innovative uses by developers is not a yes or no question but a matter of degree, and it depends largely on the quality of the data’s structure and the needs of specific developers. In deciding what structure to add, agencies should consider who is in the best position to add various types of structure to the data. Sometimes, the agency is in the best position. Employees of an agency may amass specialized knowledge about the data, or the agency may already internally store the data with structural details like explicit database columns. In these cases, the agency can provide this structure with little effort, relieving developers from the potentially Herculean task of reconstructing these details. In other cases, the agency may have no significant advantage over private parties.

Agencies should get as close to this dividing line as is reasonably possible to broaden the range of creative possibilities for application developers. The goal is to minimize structural obstacles that might prevent developers from tinkering with the data. Better structure leads to more innovative tools, a more transparent government, and a greater appreciation for the work done by federal agencies.

Over our next several posts, we’ll discuss choices that agencies make when releasing datasets and the ways these choices affect developers. Among other things, we’ll explore basic data format lessons, data labeling, and correction/modification of datasets. Our goal is to turn this series into a best practices white paper for government use, and we’d appreciate any comments, suggestions, or insights from readers.