April 20, 2014

avatar

Correcting Errors and Making Changes

[This is the fourth post in a series on best practices for government datasets by Harlan Yu and me. (previous posts: 1, 2, 3)]

Even cautiously edited datasets sometimes contain errors, and even meticulously produced schemas require refinement as circumstances change. While errors or changes create inconvenience for developers, most developers appreciate and prepare for their inevitability. Agencies should strive to do the same. A well-developed strategy for fixes and changes can ease their burden on both developers and agencies.

When agencies release data, developers ideally will interact with it in creative new ways. Given datasets containing megabytes to gigabytes of data, novel uses will reveal previously unnoticed errors. Knowledge of these errors benefits the agency as well as other developers using the data, so agencies should take steps to encourage error reporting. Labels in a dataset allow developers to specify errors efficiently and unambiguously. An easy-to-find channel for reporting errors, such as a prominently provided email address or web form, is also critical. Tracking down the contact information of the person responsible for a dataset can be difficult, and a well-known channel reduces this barrier to feedback.

Upon learning of an issue in a dataset, an agency should correct the problem and release the corrected dataset in a timely manner. An important fact to keep in mind when correcting data is that numerous developers may have already downloaded and begun using the old flawed version. For these developers, even a minor modification can cause major issues if not done carefully. Agencies should think about two things: how they will make developers aware that the dataset has been modified and how they will change the dataset itself. The first point is sometimes ignored in spite of its importance. Not only should datasets contain version information, but agencies should also notify developers when the data that they rely on has changed. In particular, agencies should allow developers to subscribe to an email list or an RSS feed for specific datasets that details updates in a well-structured manner. These updates should clearly specify the dataset and version affected, a location where the updated dataset can be found, and a description of the changes to the dataset. When possible, these changes should be specified via a formal, structured description—for example, a diff output—as well as a brief prose explanation.

Correction of dataset contents should proceed cautiously. Suppose that an application allows user to comment on parts of a document. If labels are in a dataset are not maintained consistently across versions, the developer may need to painstakingly map comments from the old data to the corresponding parts of the new dataset. Issues like this can be mitigated through several practices. First, an agency should seek to preserve labels across versions of a dataset when possible (alternatively, in some cases an agency might wish to change the labels but provide a mapping to assist developers). For example, a dataset might aggregate numerous documents, and a minor change in one document should not necessarily change the labels for the other documents. Recall the side note from our previous post that labels should be separate from ordering information. Corrections to a dataset may add, remove, or reorder items. Detaching order from labels can help agencies ensure label consistency across dataset versions. In addition, the last post and its comments discussed whether agencies should provide a label that is separate from its internally used agency label. This separation allows labels to remain consistent even when Subsection X becomes Section Y based on the internal agency labels. Note that these points about consistent labeling can be useful whenever a dataset could have multiple versions: for example, consistent labeling might be beneficial across various versions of a bill.

Similarly, the structure that agencies use for datasets, the locations where the datasets are hosted, and other details of a dataset sometimes must change. Suppose that an agency releases various statistics each month. When the agency is asked to provide a new statistic, the new data may necessitate changes to the XML schema. Alternatively, the agency may decide to host data at the address “http://www.agency.gov/YEAR/MONTH/data.xml” rather than “http://www.agency.gov/MONTH-YEAR/data.xml,” causing issues for automated tools that periodically check for and download new data. To reduce the adverse impact of these changes on developers, agencies should provide detailed notice of the changes as early as possible. Early notice gives developers time to modify their tools. These notifications can occur via an email list or RSS feed providing details of the changes in a clear, consistent format.

The possibility of changes and their impact on developers should be taken into account at all stages of the data production process. Suppose an agency adds an element to a schema that specifies a unique individual, but the schema may someday need to specify a corporation instead. Although the agency should not speculatively add unnecessary elements to the schema, it should be mindful of possible changes when designing the rest of the schema. Various design choices may minimize the impact of a change if necessary later. Agencies should also avoid the urge to alter a schema dramatically each time it requires a minor change. A major overhaul—even when done to clean up the schema—may require equally dramatic changes in tools utilizing the data. To ensure that developers notice changes to XML schemas, both schema files and datasets should contain a prominent schema version number. If an agency changes the location where data is hosted, it should consider temporarily using aliases so that requests using old addresses automatically take you to the correct data. Once the old addresses are phased out, agencies should use a standard HTTP 404 status code to indicate that the requested data was not found at the specified location. Simply supplying a “Not Found” page without this standard code could make life harder for developers whose automated tools must instead parse this page.

When making changes, agencies should consider soliciting input directly from developers. Because the preferences of developers might not be obvious, this input can lead to choices that help developers without increasing the burden on agencies. In fact, developers may even come up with ideas that make life easier for an agency.

Our next and final post in this series will discuss a handful of additional issues for agencies to consider.

Comments

  1. gawp says:

    Nice article. I work with a lot of biomedical datasets freely available on the net, and I fully agree that huge datasets inevitably have errors and inconsistencies, despite the best efforts of the maintainers. Many of them appear in merged data sets where one set is annotated with information from another set or analysis method.

    The NCBI MEDLINE database of medical literature information is a lovely example of how to do it right (thought the NCBI has many other well maintained and supported data sets). MEDLINE has updated their schema over time, they have regular updates and yearly full releases, clear versioning and release notes, and they’re very responsive to issues with the data. They also have their own search tool for that data set and very reasonable license terms for the use of the data sets for researchers.