April 17, 2014

avatar

Basic Data Format Lessons

[This is the second post in a series on best practices for government datasets by Harlan Yu and me. (previous post)]

When creating a dataset, the preferences of developers may not be obvious to those producing the dataset. Seemingly innocuous choices by data providers can lead to major headaches for developers. In this post, we discuss some of the more basic challenges that developers encounter when working with a dataset. These lessons may seem trivial to our more technical readers, but they’re often learned through experience. Our hope is to reduce this learning curve by explaining how various practices affect developers. We’ll focus on XML datasets, but many of the topics apply to CSV and other data formats.

One of the hardest parts of working with a dataset can be figuring out what’s in it and how it’s organized. What data comes inside an “<FL47>” tag? Can a “<TEXT>” element ever contain a “<PARAGRAPH>” element? Developers rely heavily on documentation to explain the structure and contents of a dataset. When working with XML, one particularly relevant item is known as a schema. An XML schema is a separate file with an extension such as “.dtd” or “.xsd,” and it provides a blueprint of the permitted structure for corresponding XML files. XML schema files tell developers where they can recover the information that they need from a dataset. These schema files and other documentation are often a necessity for developers, and they should be treated as such by data providers. Any XML file supplied by an agency should contain a complete URL address at which its schema can be found. Further, any link to an XML document on a government site should have prominent links near it for the corresponding schema file and reasonable documentation describing the contents of the dataset.

XML schema files can be seen as an informal contract between data providers and developers, effectively promising that a dataset will match the specified structure. Unfortunately, sometimes datasets contain flaws causing them not to match that structure. Although experienced developers produce software that detects the existence of structural errors, these errors can be difficult or impossible for them to isolate and correct. The people in the best position to catch and fix structural errors are the people producing a dataset. Numerous validation tools exist for ensuring that an XML document is well-formed and valid—that is, the document is structurally sound and matches its XML schema. Prior to releasing a dataset, an agency should run a validator on it to check for structural flaws. This sanity check can take just a few moments for an agency but save hours of developer time.

When deciding on the structure of a dataset, an agency should strive for simplicity while logically representing the underlying data. The addition of elements, attributes, or children in a schema can improve the quality and clarity of the dataset, but it can also add unnecessary complexity. When designing schemas, there’s a tendency to include elements or other structure that will almost certainly go unused in practice. Schema designers may assume that extraneous items do no harm, but developers must cautiously account for them if allowed by schema. The result can be wasted developer time and increased software complexity. The true cost of various structural choices is not just the time necessary to encode these choices in a schema but also the burden these choices impose on developers. Additional structural complexity must provide a justifiable benefit.

In some cases, however, the addition of elements or attributes is not only justifiable but highly desirable for developers: logically distinct pieces of data should appear in separate XML elements or attributes. Suppose that a developer wishes to access a piece of data in a dataset. If the data is combined with other information, the developer will need to figure out how to extract it from the combined field. This extraction can be difficult, time-consuming, and prone to errors. For example, assume that a data provider includes the following element:

<DOCINFO>Doc No. 2001345--Released 01-01-2001</DOCINFO>

To extract the document number, a developer might look for all characters following “No.” but before a dash. While this is straightforward enough, other parts of the same or future datasets might instead use the document number format “2001-345″ or separate the document number and release date with a space rather than a double-dash. Neither case would lead to invalid XML, but both would break the developer’s extraction tool. Now consider this alternative:

<DOCINFO>
  <DOC_NO>2001345</DOC_NO>
  <RELEASE_DATE>01-01-2001</RELEASE_DATE>
</DOCINFO>

Using extra elements to separate logically distinct data can prevent extraction errors. This lesson often applies even when the combined data is related. For example, the version number 5.3.2 could be broken into major version 5, minor version 3, and revision 2. In general, agencies should separate such items themselves when they can do so more easily than developers.

Even when the basic structure of a dataset is ideal, choices about how to provide data inside this structure can affect developers. Developers thrive on consistency. Suppose that a dataset details various costs. Consider all possible ways of writing cost: $4,300, 5938.37, 74 dollars and 63 cents, etc. Unless an agency decides on, documents, and adheres to a standard format, developers’ software must handle a large number of possibilities to avoid unexpected surprises. Consistency in a dataset can make a developer’s life far easier, and it reduces the possibility that surprises will break an application. Note that a schema can be helpful for enforcing consistency for certain fields—for example, cost might be defined as a decimal field with a constraint on the number of fractional digits.

Redundant information is another source of difficulty for developers. Redundancy can appear in numerous ways. Suppose that a dataset contains the element “<VERSION>Version 5</VERSION>.” The word “Version” is unnecessary, and developers must go through additional trouble to extract the version number. In so doing, developers must consider the possibility that “Version” could be misspelled, abbreviated, or omitted. Supplying a version number alone (“<VERSION>5</VERSION>”) would avoid this issue altogether. More subtly, suppose that a dataset contains all bills introduced in Congress on a certain date:

<INTRODUCED_BILLS>
  <DATE>11-12-2014</DATE>
  <HOUSE_BILLS DATE="NOV 12, 2014">
    [...]
  </HOUSE_BILLS>
  <SENATE_BILLS DATE="NOV 12, 2014">
    [...]
  </SENATE_BILLS>
</INTRODUCED_BILLS>

Date information appears three times even though it must be the same in all cases. The more often a piece of information appears in a dataset, the more likely that inconsistencies will occur. These inconsistencies can lead to software errors requiring manual resolution. While redundancy can serve as a sanity check for errors, agencies typically should perform this check themselves if possible before releasing the data. After all, the agency is in the best position to fix inconsistencies. Unless well-justified, agencies should avoid redundancy.

Processing datasets often requires a significant amount of developer time, so adherence to even basic rules can dramatically increase innovation. What other low-level recommendations do FTT readers have for non-developers producing datasets?

Tomorrow, we’ll discuss how labeling elements in a dataset can help developers.

Comments

  1. Staffan Malmgren says:

    When possible, documents should use standard datatypes (eg those specified in XML Schema, http://www.w3.org/TR/xmlschema-2/) with standard formatting. Schemas should declare what datatype a certain element uses. For example, dates should be declared as xsd:date as per the XSD specification, and use ISO 8601 formatting (eg “2001-01-01″ rather than “01-01-2001″ or “Jan 1, 2001″)

  2. Amateur Layman says:

    I have experience in designing B2B systems and I agree wholeheartedly with all the advice above.

    Automatic Self validation before delivery is important, often you dont know exactly what is going into your xml. There will be assumptions and bugs you don’t know about. For instance I had the case where special microsoft quotes were pasted into my app from MS Office, they weren’t valid ascii characters so broke my xml (for some xml parsers anyway).

    Datatypes by Staffan above, yes I also agree, and additionally specify whether it will be DATE only, or Date + Time and number of seconds and fractions of seconds accuracy.

    KISS is extremely important, even if you have standard dataformats that might cover the data, if you use them it may complicate the outcome. For instance I had an datafeed xml that included three XNAL schema, my schema and the file management schema and I didn’t even include GML. The resultant mess was hard to get right and would be hard to consume except for the agency receiving it.

    Principal of least surprise is a useful guideline. Dont surprise the consumer of the file.

    Try not to use dynamic datatypes/structures or other surprising features.

    Provide a simple example xml with all the allowed tags and attributes, so the consumer can test against that ‘most complex’ file before looking at the actual data.

    Design for change, allowing for extensions over time, with versioning at an appropriate level (document or tag). This also might mean using some simple dynamic structures such as unconstrained name/value lists.

    At the consumption end wholly different design constraints apply.

    Andrew.

  3. rp says:

    On the other hand, people who develop a schema must be absolutely sure that all of the things that are going to be represented in that schema actually fit into it. The people who are generating the content aren’t going to be happy if some of the things they want to say (or have been saying on rare occasions for many years) don’t fit the new schema. And unstructured appendages to the structured material may or may not be effective.

    With enough work on design and validation, and enough iterations, this problem can of course be solved, but in a resource-constrained situation it may sometimes be better to let the information consumers and their developers sort things out than to build a schema that makes promises it can’t really keep.

  4. supercat says:

    What should one do if the data has various fields which are generally redundant, but which–for whatever reason–sometimes contain data that does not fit the pattern? One may know that any data not fitting the pattern contains an error, but it may not be clear where that error is. What if there are three fields, any two of which would allow calculation of the third, but where one will sometimes only know one (e.g. the Universal Coordinated Time of an event, the local time of that event, and the time zone where the event occurred)?