June 15, 2019

Archives for March 2010

Correcting Errors and Making Changes

[This is the fourth post in a series on best practices for government datasets by Harlan Yu and me. (previous posts: 1, 2, 3)]

Even cautiously edited datasets sometimes contain errors, and even meticulously produced schemas require refinement as circumstances change. While errors or changes create inconvenience for developers, most developers appreciate and prepare for their inevitability. Agencies should strive to do the same. A well-developed strategy for fixes and changes can ease their burden on both developers and agencies.

When agencies release data, developers ideally will interact with it in creative new ways. Given datasets containing megabytes to gigabytes of data, novel uses will reveal previously unnoticed errors. Knowledge of these errors benefits the agency as well as other developers using the data, so agencies should take steps to encourage error reporting. Labels in a dataset allow developers to specify errors efficiently and unambiguously. An easy-to-find channel for reporting errors, such as a prominently provided email address or web form, is also critical. Tracking down the contact information of the person responsible for a dataset can be difficult, and a well-known channel reduces this barrier to feedback.

Upon learning of an issue in a dataset, an agency should correct the problem and release the corrected dataset in a timely manner. An important fact to keep in mind when correcting data is that numerous developers may have already downloaded and begun using the old flawed version. For these developers, even a minor modification can cause major issues if not done carefully. Agencies should think about two things: how they will make developers aware that the dataset has been modified and how they will change the dataset itself. The first point is sometimes ignored in spite of its importance. Not only should datasets contain version information, but agencies should also notify developers when the data that they rely on has changed. In particular, agencies should allow developers to subscribe to an email list or an RSS feed for specific datasets that details updates in a well-structured manner. These updates should clearly specify the dataset and version affected, a location where the updated dataset can be found, and a description of the changes to the dataset. When possible, these changes should be specified via a formal, structured description—for example, a diff output—as well as a brief prose explanation.

Correction of dataset contents should proceed cautiously. Suppose that an application allows user to comment on parts of a document. If labels are in a dataset are not maintained consistently across versions, the developer may need to painstakingly map comments from the old data to the corresponding parts of the new dataset. Issues like this can be mitigated through several practices. First, an agency should seek to preserve labels across versions of a dataset when possible (alternatively, in some cases an agency might wish to change the labels but provide a mapping to assist developers). For example, a dataset might aggregate numerous documents, and a minor change in one document should not necessarily change the labels for the other documents. Recall the side note from our previous post that labels should be separate from ordering information. Corrections to a dataset may add, remove, or reorder items. Detaching order from labels can help agencies ensure label consistency across dataset versions. In addition, the last post and its comments discussed whether agencies should provide a label that is separate from its internally used agency label. This separation allows labels to remain consistent even when Subsection X becomes Section Y based on the internal agency labels. Note that these points about consistent labeling can be useful whenever a dataset could have multiple versions: for example, consistent labeling might be beneficial across various versions of a bill.

Similarly, the structure that agencies use for datasets, the locations where the datasets are hosted, and other details of a dataset sometimes must change. Suppose that an agency releases various statistics each month. When the agency is asked to provide a new statistic, the new data may necessitate changes to the XML schema. Alternatively, the agency may decide to host data at the address “http://www.agency.gov/YEAR/MONTH/data.xml” rather than “http://www.agency.gov/MONTH-YEAR/data.xml,” causing issues for automated tools that periodically check for and download new data. To reduce the adverse impact of these changes on developers, agencies should provide detailed notice of the changes as early as possible. Early notice gives developers time to modify their tools. These notifications can occur via an email list or RSS feed providing details of the changes in a clear, consistent format.

The possibility of changes and their impact on developers should be taken into account at all stages of the data production process. Suppose an agency adds an element to a schema that specifies a unique individual, but the schema may someday need to specify a corporation instead. Although the agency should not speculatively add unnecessary elements to the schema, it should be mindful of possible changes when designing the rest of the schema. Various design choices may minimize the impact of a change if necessary later. Agencies should also avoid the urge to alter a schema dramatically each time it requires a minor change. A major overhaul—even when done to clean up the schema—may require equally dramatic changes in tools utilizing the data. To ensure that developers notice changes to XML schemas, both schema files and datasets should contain a prominent schema version number. If an agency changes the location where data is hosted, it should consider temporarily using aliases so that requests using old addresses automatically take you to the correct data. Once the old addresses are phased out, agencies should use a standard HTTP 404 status code to indicate that the requested data was not found at the specified location. Simply supplying a “Not Found” page without this standard code could make life harder for developers whose automated tools must instead parse this page.

When making changes, agencies should consider soliciting input directly from developers. Because the preferences of developers might not be obvious, this input can lead to choices that help developers without increasing the burden on agencies. In fact, developers may even come up with ideas that make life easier for an agency.

Our next and final post in this series will discuss a handful of additional issues for agencies to consider.

Labeling Dataset Contents

[This is the third post in a series on best practices for government datasets by Harlan Yu and me. (previous posts)]

When the government releases a dataset, citizens ideally will discuss the contents and supply educated feedback. The ability to reference facts and figures in a dataset supports a constructive dialog. Vague concerns are harder to articulate and address than ones citing specific paragraphs in a document. In this post, we’ll discuss why data labeling supports this goal, and when and how government agencies should uniquely label data inside a dataset for citability. As in the previous post, our focus will be on XML, though the lessons apply to other formats.

As our interactions with each other and with our government increasingly occur online, the need for precise communication has also increased. Open-government initiatives can give knowledge and voices to more citizens than ever before, but this can lead to an almost overwhelming quantity of discussion. Various technologies can help us to manage and make sense of this information, but these technologies are most effective with unambiguous data. For example, tools could sort citizens’ comments on a bill by section, but this task can be difficult unless the comments cite sections. One way to encourage citations is by placing tags in the dataset that citizens and open-government tools can easily reference.

The structure of XML implicitly enables referencing of elements in a sense. A citizen could cite the seventh “<PARARGRAPH>” element in the twenty-eighth “<DOCUMENT>” element in a dataset. Even ignoring how error-prone counting is for humans, reliance on this structure is not ideal. XML schemas can specify order for elements of different types but not the same type—a parser could validly retrieve <PARAGRAPH> elements of a document in any order (we’ll discuss in our next post why labels and ordering should be treated as two separate problems; our point here is only that element order should not be used as an implicit label). In addition, different parties may come up with different reference schemes in the absence of an explicit authoritative one. The agency creating a dataset might refer to the paragraph referenced above as Section XII of Document K6-2495, and another developer might refer to it as “<PARAGRAPH>” 147. An abundance of reference schemes can make it harder for government officials to understand citizens, harder for citizens to understand each other, and harder for developers to merge the function and output of their tools. Using an explicit common reference scheme avoids these issues.

Of course, different uses require different forms of labeling, and agencies cannot meet the desires of everyone. How can they decide where to add labels? Recall that our previous posts address the question of who should add what structure to a dataset. Agencies should use the answer as a guide for where to add labels, generally adding labels to all elements they create. If an agency breaks text up by paragraph, each paragraph should be citable; if it breaks text up by sentence, each sentence should be citable. Labels are fairly straightforward to add to elements in XML, so this rule imposes minimal additional work on agencies. Additional partitioning and labeling of data can be left to private parties. Some precedence already exists for private party involvement here: Citability.org is working to enable citation of government documents at a paragraph level.

When agencies add labels, they should strive to use the same reference schemes used internally. Unfortunately, labeling schemes utilizing Roman numerals, letters, or almost anything other than Arabic numerals (0, 1, 2, etc.) can be hard to process. For these cases, the agency should include two labels: an internal agency label and a numeric label. While this suggestion runs counter to our rule against redundancy, it makes the labels far easier to process and facilities easy translation between both schemes.

In general, however, the lessons from past posts should be kept in mind when labeling, including the points about avoiding redundancy: the label for Part 2 of a document should appear in element names and attributes (e.g., “<PART LABEL="2">[…]</PART>”) rather than text. Labels should uniquely identify an element among those with the same parent, but a label may not be necessary if an element’s type is unique among its siblings.

To make these recommendations more concrete, we end with an example. Consider the following document:

  Notice 2982:  Proposal to Increase Public Transit Fees

  Section I.  Budget Shortfall
  In fiscal year 2009, [...]
  Unless changes are made [...]

  Section II.  Decreasing the Deficit
  To compensate for [...]
  This relatively modest [...]

This document could be represented in a dataset as:

<DATASET>
  [...]
  <NOTICE LABEL="2982">
    <TITLE>Proposal to Increase Public Transit Fees</TITLE>
    <SECTION AGENCY_LABEL="I" LABEL="1">
      <TITLE>Budget Shortfall</TITLE>
      <PARAGRAPH LABEL="1">In fiscal year 2009, [...]</PARAGRAPH>
      <PARAGRAPH LABEL="2">Unless changes are made [...]</PARAGRAPH>
    </SECTION>
    <SECTION AGENCY_LABEL="II" LABEL="2">
      <TITLE>Decreasing the Deficit</TITLE>
      <PARAGRAPH LABEL="1">To compensate for [...]</PARAGRAPH>
      <PARAGRAPH LABEL="2">This relatively modest [...]</PARAGRAPH>
    </SECTION>
  </NOTICE>
  [...]
</DATASET>

Among other things, we can uniquely reference the notice (Notice 2982) and each paragraph (e.g., Notice 2982, Section II, paragraph 1).

In our next post, we’ll discuss how agencies can handle errors and make other changes while reducing the strain on developers.

Basic Data Format Lessons

[This is the second post in a series on best practices for government datasets by Harlan Yu and me. (previous post)]

When creating a dataset, the preferences of developers may not be obvious to those producing the dataset. Seemingly innocuous choices by data providers can lead to major headaches for developers. In this post, we discuss some of the more basic challenges that developers encounter when working with a dataset. These lessons may seem trivial to our more technical readers, but they’re often learned through experience. Our hope is to reduce this learning curve by explaining how various practices affect developers. We’ll focus on XML datasets, but many of the topics apply to CSV and other data formats.

One of the hardest parts of working with a dataset can be figuring out what’s in it and how it’s organized. What data comes inside an “<FL47>” tag? Can a “<TEXT>” element ever contain a “<PARAGRAPH>” element? Developers rely heavily on documentation to explain the structure and contents of a dataset. When working with XML, one particularly relevant item is known as a schema. An XML schema is a separate file with an extension such as “.dtd” or “.xsd,” and it provides a blueprint of the permitted structure for corresponding XML files. XML schema files tell developers where they can recover the information that they need from a dataset. These schema files and other documentation are often a necessity for developers, and they should be treated as such by data providers. Any XML file supplied by an agency should contain a complete URL address at which its schema can be found. Further, any link to an XML document on a government site should have prominent links near it for the corresponding schema file and reasonable documentation describing the contents of the dataset.

XML schema files can be seen as an informal contract between data providers and developers, effectively promising that a dataset will match the specified structure. Unfortunately, sometimes datasets contain flaws causing them not to match that structure. Although experienced developers produce software that detects the existence of structural errors, these errors can be difficult or impossible for them to isolate and correct. The people in the best position to catch and fix structural errors are the people producing a dataset. Numerous validation tools exist for ensuring that an XML document is well-formed and valid—that is, the document is structurally sound and matches its XML schema. Prior to releasing a dataset, an agency should run a validator on it to check for structural flaws. This sanity check can take just a few moments for an agency but save hours of developer time.

When deciding on the structure of a dataset, an agency should strive for simplicity while logically representing the underlying data. The addition of elements, attributes, or children in a schema can improve the quality and clarity of the dataset, but it can also add unnecessary complexity. When designing schemas, there’s a tendency to include elements or other structure that will almost certainly go unused in practice. Schema designers may assume that extraneous items do no harm, but developers must cautiously account for them if allowed by schema. The result can be wasted developer time and increased software complexity. The true cost of various structural choices is not just the time necessary to encode these choices in a schema but also the burden these choices impose on developers. Additional structural complexity must provide a justifiable benefit.

In some cases, however, the addition of elements or attributes is not only justifiable but highly desirable for developers: logically distinct pieces of data should appear in separate XML elements or attributes. Suppose that a developer wishes to access a piece of data in a dataset. If the data is combined with other information, the developer will need to figure out how to extract it from the combined field. This extraction can be difficult, time-consuming, and prone to errors. For example, assume that a data provider includes the following element:

<DOCINFO>Doc No. 2001345--Released 01-01-2001</DOCINFO>

To extract the document number, a developer might look for all characters following “No.” but before a dash. While this is straightforward enough, other parts of the same or future datasets might instead use the document number format “2001-345” or separate the document number and release date with a space rather than a double-dash. Neither case would lead to invalid XML, but both would break the developer’s extraction tool. Now consider this alternative:

<DOCINFO>
  <DOC_NO>2001345</DOC_NO>
  <RELEASE_DATE>01-01-2001</RELEASE_DATE>
</DOCINFO>

Using extra elements to separate logically distinct data can prevent extraction errors. This lesson often applies even when the combined data is related. For example, the version number 5.3.2 could be broken into major version 5, minor version 3, and revision 2. In general, agencies should separate such items themselves when they can do so more easily than developers.

Even when the basic structure of a dataset is ideal, choices about how to provide data inside this structure can affect developers. Developers thrive on consistency. Suppose that a dataset details various costs. Consider all possible ways of writing cost: $4,300, 5938.37, 74 dollars and 63 cents, etc. Unless an agency decides on, documents, and adheres to a standard format, developers’ software must handle a large number of possibilities to avoid unexpected surprises. Consistency in a dataset can make a developer’s life far easier, and it reduces the possibility that surprises will break an application. Note that a schema can be helpful for enforcing consistency for certain fields—for example, cost might be defined as a decimal field with a constraint on the number of fractional digits.

Redundant information is another source of difficulty for developers. Redundancy can appear in numerous ways. Suppose that a dataset contains the element “<VERSION>Version 5</VERSION>.” The word “Version” is unnecessary, and developers must go through additional trouble to extract the version number. In so doing, developers must consider the possibility that “Version” could be misspelled, abbreviated, or omitted. Supplying a version number alone (“<VERSION>5</VERSION>”) would avoid this issue altogether. More subtly, suppose that a dataset contains all bills introduced in Congress on a certain date:

<INTRODUCED_BILLS>
  <DATE>11-12-2014</DATE>
  <HOUSE_BILLS DATE="NOV 12, 2014">
    [...]
  </HOUSE_BILLS>
  <SENATE_BILLS DATE="NOV 12, 2014">
    [...]
  </SENATE_BILLS>
</INTRODUCED_BILLS>

Date information appears three times even though it must be the same in all cases. The more often a piece of information appears in a dataset, the more likely that inconsistencies will occur. These inconsistencies can lead to software errors requiring manual resolution. While redundancy can serve as a sanity check for errors, agencies typically should perform this check themselves if possible before releasing the data. After all, the agency is in the best position to fix inconsistencies. Unless well-justified, agencies should avoid redundancy.

Processing datasets often requires a significant amount of developer time, so adherence to even basic rules can dramatically increase innovation. What other low-level recommendations do FTT readers have for non-developers producing datasets?

Tomorrow, we’ll discuss how labeling elements in a dataset can help developers.