Posts tagged with: Government transparency

Assessing PACER's Access Barriers

The U.S. Courts recently conducted a year-long assessment of their Electronic Public Access program which included a survey of PACER users. While the results of the assessment haven’t been formally published, the Third Branch Newsletter has an interview with Bankruptcy Judge J. Rich Leonard that discusses a few high-level findings of the survey. Judge Leonard has been heavily involved in shaping the evolution of PACER since its inception twenty years ago and continues to lead today.

The survey covered a wide range of PACER users—“the courts, the media, litigants, attorneys, researchers, and bulk data collectors”—and Judge Leonard claims they found “a remarkably high level of satisfaction”: around 80% of those surveyed were “satisfied” or “very satisfied” with the service.

If we compare public access before we had PACER to where we are now, there is clearly much success to celebrate. But the key question is not only whether current users are satisfied with the service but also whether PACER is reaching its entire audience of potential users. Are there artificial obstacles preventing potential PACER users—who admittedly would be difficult to poll—from using the service? The satisfaction statistic may be fine at face value, assuming that a representative sample of users were polled, but it could be misleading if it’s being used to gauge the overall success of PACER as a public access system.

One indicator of obstacles may be another statistic cited by Judge Leonard: “about 45% of PACER users also use CM/ECF,” the Courts’ electronic case management and filing system. To put it another way, nearly half of all PACER users are currently attorneys who practice federal law.

That number seems inordinately high to me and suggests that significant barriers to public access may exist. In particular, account registration requires all users to submit a valid credit card for billing (or alternatively a valid home address to receive log-in credentials and billing statements by mail.) Even if users’ credit cards are never charged, this registration hurdle may already turn away many potential PACER users at the door.

The other barrier is obviously the cost itself. With a few exceptions, users are forced to pay a fee for each document they download, at a metered rate of eight-cents per page. Judge Leonard asserts that “surprisingly, cost ranked way down” in the survey and that “most people thought they paid a fair price for what they got.”

But this doesn’t necessarily imply that cost isn’t a major impediment to access. It may just be that those surveyed—primarily lawyers—simply pass the cost of using PACER down to their clients and never bear the cost themselves. For the rest of PACER users who don’t have that luxury, the high cost of access can completely rule out certain kinds of legal research, or cause users to significantly ration and monitor their usage (as is the case even in the vast majority of our nation’s law libraries), or wholly deter users from ever using the service.

Judge Leonard rightly recognizes that it’s Congress that has authorized the collection of user fees, rather than using general taxpayer money, to fund the electronic public access program. But I wish the Courts would at least acknowledge that moving away from a fee-based model, to a system funded by general appropriations, would strengthen our judicial process and get us closer to securing each citizen’s right to equal protection under the law.

Rather than downplaying the barriers to public access, the Courts should work with Congress to establish a way forward to support a public access system that is truly open. They should study and report on the extent to which Congress already funds PACER indirectly, through Executive and Legislative branch PACER fee payments to the Judiciary, and re-appropriate those funds directly. If there is a funding shortfall, and I assume there will be, they should study the various options for closing that gap, such as additional direct appropriations or a slight increase in certain filing fees.

With our other two branches of government making great strides in openness and transparency with the help of technology, the Courts similarly needs to transition away from a one-size-fits-all approach to information dissemination. Public access to the courts will be fundamentally transformed by a vigorous culture of civic innovation around federal court documents, and this will only happen if the Courts confront today’s access barriers head-on and break them down.

(Thanks to Daniel Schuman for pointing me to the original article.)

Broadband Politics and Closed-Door Negotiations at the FCC

The last seven days at the FCC have been drama-filled, and that's not something you can often say about an administrative agency. As I noted in my last post, the FCC is considering reclassifying broadband as a "common carrier" service. This would subject the access portion of the service to some additional regulations which currently do not apply, but have (to some extent) been applied in the past. Last Thursday, the FCC voted 3-2 along party lines to pursue a Notice of Inquiry about this approach and others, in order to help solidify its ability to enforce consumer protections and implement the National Broadband Plan in the wake of the Comcast decision in the DC Circuit Court. There was a great deal of politicking and rhetoric around the vote. Then, on Monday, the Wall Street Journal reported that lobbyists were engaged in closed-door meetings at the FCC, discussing possible legislative compromises that would obviate the need for reclassification. This led to public outcry from everyone who was not involved in the meetings, and allegations of misconduct by the FCC for its failure to disclose the meetings. If you sit through my description of the intricacies of reclassification, I promise to give you the juicy bits about the controversial meetings.

The Reclassification Vote and the NOI
As I explained in my previous post, the FCC faces a dilemma. The DC Circuit said it did not have the authority under Title I of the Communications Act to enforce the broadband openness principles it espoused in 2005. This cast into doubt the FCC's ability to not only police violations of the principles but also to implement many portions of the National Broadband Plan. In the past, the Commission would have had unquestioned authority under Title II of the Act, but in a series of decisions from 2002-2007 it voluntarily "deregulated" broadband by classifying it as a Title I service. Chairman Genachowski has floated what he calls a "Third Way" approach in which broadband is not classified as a Title I service anymore, and is not subject to all provisions of Title II, but instead is classified under Title II but with extensive "forbearance" from portions of that title.

From a legal perspective, the main question is whether the FCC has the authority to reclassify the transmission component of broadband internet service as a Title II service. This gets into intricacies of how broadband service fits into statutory definitions of "information service" (aka Title I), "telecommunications", "telecommunications service" (aka Title II), and the like. I was going to lay these out in detail, but in the interest of getting to the juicy stuff I will simply direct you to Harold Feld's excellent post. For the "Third Way" approach to work, the FCC's interpretation of a "telecommunications service" will have to be articulated to include broadband internet access while not also swallowing a variety of internet services that everyone thinks should remain unregulated -- sites like Facebook, content delivery networks like Akamai, and digital media providers like Netflix. However, this narrow definition must not be so narrow that the FCC does not have jurisdiction to police the types of practices it is concerned about (for instance, providers should not be able to discriminate in their delivery of traffic simply by moving the discrimination from their transport layer of the network to the logical layer, or by partnering with an affiliated "ISP" that does discrimination for them). I am largely persuaded of Harold's arguments, but the AT&T lobbyists present the other side as well. One argument that I don't see anyone making (yet) is that presuming the transmission component is subject to Title II, the FCC would seem to have a much stronger argument for exercising ancillary jurisdiction with respect to interrelated components like non-facilities-based ISPs that rely on that transmission component.

The other legal debate involves an even more arcane discussion about whether -- assuming there is a "telecommunications service" offered as part of broadband service -- that "telecommunications service" is something that can be regulated separately from the other "information services" (Title I) that might be offered along with it. This includes things like an email address from your provider, DNS, Usenet, and the like. Providers have historically argued that these were inseparable from the internet access component, and the so-called "Stevens Report" of 1998 introduced the notion that the "inextricably intertwined" nature of broadband service might have the result of classifying all such services as entirely Title I "information services." To the extent that this ever made any sense, it is far from true today. What consumers believe they are purchasing is access to the internet, and all of those other services are clearly extricable from a definitional and practical standpoint (indeed, customers can and do opt for competitors for all of them on a regular basis).

But none of these legal arguments are at the fore of the current debate, which is almost entirely political. Witness, for example, John Boehner's claim that the "Third Way" approach was a "government takeover of the Internet," Fred Upton's (R-MI) claim that the approach is a "blind power grab," modest Democratic sign-on to an industry-penned and reasoning-free opposition letter, and an attempt by Republican appropriators to block funding for the FCC unless they swore off the approach. This prompted a strong response from Democratic leaders indicating that any such effort would not see the light of day. Ultimately, the FCC voted in favor of the NOI to explore the issue. Amidst this tumult, the WSJ reported that the FCC had started closed-door meetings with industry representatives in order to discuss a possible legislative compromise.

Possible Legislation and Secret Meetings
It is not against the rules to communicate with the FCC about active proceedings. Indeed, such communications are part of a healthy policymaking process that solicits input from stakeholders. The FCC typically conducts proceedings under the "permit but disclose" regime in which all discussions pertaining to the given proceeding must be described in "ex parte" filings on the docket. Ars has a good overview of the ex parte regime. The NOI passed last week is subject to these rules.


Free Press Ad in 6/23 Washington Post

It therefore came as a surprise that a subset of industry players were secretly meeting with the FCC to discuss possible legislation that could make the NOI irrelevant. This issue is made even more egregious by the fact that the FCC just conducted a proceeding on improving ex parte disclosures, and the Chairman remarked:

"Given the complexity and importance of the issues that come before us, ex parte communications remain an essential part of our deliberative process. It is essential that industry and public stakeholders know the facts and arguments presented to us in order to express informed views."

The Chairman's Chief of Staff Edward Lazarus sought to explain away the obligation for ex parte disclosure, and nevertheless attached a brief disclosure letter from the meeting attendees that didn't describe any of the details. There is perhaps a case to be made that the legislative options do not directly fall under the subject matter of the NOI, but even if this position were somehow legally justifiable it clearly falls afoul of the policy intent of the ex parte rules. Harold Feld has a great post in which he describes his nomination for "Worsht Ex Parte Ever". The letter attached to the Lazarus post would certainly take the title if it were a formal ex parte letter. The industry participants in the meetings deserve some criticism, but ultimately the problems can only be resolved by the FCC by demanding comprehensive openness rather than perpetuating a culture of loopholes.

The public outcry continues, from both public interest groups and in the comments on the Lazarus post. If it's true that the FCC admits internally that "they f*cked up", they should do far more to regain the public's trust in the integrity of the notice-and-comment process.

Update: The Lazarus post was just updated to replace the link to the brief disclosure letter with two new links to letters that describe themselves as Ex Parte letters. The first contains the exact same text as the original, and the second has a few bullet points.

Release Government Data, Early and Often

One of the key axioms of modern open government is that all public data should be published online in a raw but usable form. Usability in this case is aimed at software programmers. By making government datasets more usable, programmers are more likely to innovate in the civic sphere and build technologies, using the raw data, to enhance the relationships among citizens and with government.

The open government community has provided plenty of valuable guidance about what usability means for programmers. We proclaim that all datasets need to be: published in a format that is reasonably structured and machine-processable; well-documented; downloadable in bulk; authenticated using cryptographic digital signatures; version-controlled; permanent and citable; and the list goes on and on. These are all worthy principles to be sure, and all government datasets should strive to meet them.

But you'll be hard-pressed to find any government datasets that exist with all of these principles pre-satisfied. While some are in better shape than others, most datasets would make programmers cringe. Data often only exist as informal working sets in proprietary Excel spreadsheets. Sometimes they are in structured databases, but schemas are undocumented, field values are ambiguous, and the semantics are only understood by the employee who created them. Datasets have errors and biases that are known but never explicitly corrected.

For a civil servant who is a data caretaker looking over the laundry list of publishing principles, there's frequently a huge quality chasm between the dataset she owns and how people are asking to see it released. To her, publishing this data adequately just seems like a lot of extra work. The more attractive alternative is to put off the data publishing—it's not in her job description or evaluations anyway—and move on to other work instead.

How can this chasm be bridged? A widely-adopted philosophy in software development and entrepreneurship would serve open government data well: release early and release often. And listen to your customers.

In the software development world, a working version of the product is pushed out as soon as possible even with known imperfections—an "alpha" release—so it can be subject to real use by early adopters. Early adopters can provide helpful feedback about what works, what's broken, and what new features would be most useful to them. The software developers then iterate quickly. They incorporate the suggested fixes and features into their code and release an updated version of the product to their users. The virtuous cycle then starts again. Under this philosophy, software developers can be efficient about how to best improve their code where it matters, and users get software that works better and has more features they desire.

The "release early, release often" philosophy should be applied to government data. For the initial release, data caretakers should take the path of least resistance to get data out the door. This means publishing datasets in whatever format is most convenient, along with as much documentation as can reasonably be mustered. Documentation is especially important with an "alpha" dataset—proper warnings about its problems, instabilities and inductive limitations must be prominently displayed. (Of course, the usual privacy and legal caveats should also be applied.) Sometimes, the "alpha" release will be "good enough" for programmers to start their work, and this will minimize any superfluous work done by caretakers. This is the virtue of "release early."

In other cases, programmers will need assistance using the dataset and will notice problem spots with the initial release. The dataset might be confusing, contain errors or be difficult to work with. A tight feedback mechanism allows the programmer to get help quickly and continue to innovate, while the data caretaker can fix problems based on real use cases and add clarifying metadata into an updated version of the dataset. Data quality and usability increases for those working with the dataset, both in and outside of government. That's the virtue of "release often."

And here is the big opportunity for government: no platform currently exists to engage the prime audience for government data—software programmers. Without a tight feedback mechanism, the virtuous cycle of mutual benefit cannot exist. Government is missing its best opportunity to improve data quality by neglecting useful feedback from programmers who are actually tinkering with the datasets. Society is losing out on potentially game-changing civic innovations, which otherwise would have been built if data were more usable and the uncertainty of failure reduced.

A terrific start in turning the corner would be for government to adopt an issue-tracking system for its datasets. As a public venue, it would help ensure that data caretakers are prompt in addressing developer concerns. It would also allow caretakers to organize feedback in a formal way. Such platforms are commonplace in any successful software development venture. The same needs to be true for government data in order to drive rapid quality improvements and increase developer engagement.

CITP Expands Scope of RECAP

Today, we're thrilled to announce the next version of our RECAP technology, dramatically expanding the scope of the project.

Having had some modest success at providing public access to legal documents, we're now taking the next logical step, offering easy public access to illegal documents.

The Internet Archive, which graciously hosts RECAP's repository of legal documents, was strangely unreceptive to our offer to let them store the world's most comprehensive library of illegal documents. Fortunately, the Pirate Bay was happy to step in and help.

Interested in seeing what's available? Then you might want to watch our brief instructional video.

Best Practices for Government Datasets: Wrap-Up

[This is the fifth and final post in a series on best practices for government datasets by Harlan Yu and me. (previous posts: 1, 2, 3, 4)]

For our final post in this series, we'll discuss several issues not touched on by earlier posts, including data signing and the use of certain non-text file formats. The relatively brief discussions of these topics should not be interpreted as an indicator of their importance. The topics simply did not fit cleanly into earlier posts.

One significant omission from earlier posts is the issue of data signing with digital signatures. Before discussing this issue, let's briefly discuss what a digital signature is. Suppose that you want to email me an IOU for $100. Later, I may want to prove that the IOU came from you—it's of little value if you can claim that I made it up. Conversely, you may want the ability to prove whether the document has been altered. Otherwise, I could claim that you owe me $100,000.

Digital signatures help in proving the origin and authenticity of data. These signatures require that you create two related big numbers, known as keys: a private signing key (known only by you) and a public verification key. To generate a digital signature, you plug the data and your signing key into a complicated formula. The formula spits out another big number known a digital signature. Given the signature and your data, I can use the verification key to prove that the data came unmodified from you. Similarly, nobody can credibly sign modified data without your signing key—so you should be very careful to keep this key a secret.

Developers may want to ensure the authenticity of government data and to prove that authenticity to users. At first glance, the solution seems to be a simple application of digital signatures: agencies sign their data, and anyone can use the signatures to authenticate an agency's data. In spite of their initially steep learning curve, tools like GnuPG provide straightforward file signing. In practice, the situation is more complicated. First, an agency must decide what data to sign. Perhaps a dataset contains numerous documents. Developers and other users may want signatures not only for the full dataset but also for individual documents in it.

Once an agency knows what to sign, it must decide who will perform the signing. Ideally, the employee producing the dataset would sign it immediately. Unfortunately, this solution requires all such employees to understand the signature tools and to know the agency's signing key. Widespread distribution of the signing key increases the risk that it will be accidentally revealed. Therefore, a central party is likely to sign most data. Once data is signed, an agency must have a secure channel for delivering the verification key to consumers of the data—users cannot confirm the authenticity of signed data without this key. While signing a given file with a given key may not be hard, surrounding issues are more tricky. We offer no simple solution here, but further discussion of this topic between government agencies, developers, and the public could be useful for all parties.

Another issue that earlier posts did not address is the use of non-text spreadsheet formats, including Microsoft Excel's XLS format. These formats can sometimes be useful because they allow the embedding of formulas and other rich information along with the data. Unfortunately, these formats are far more complex than raw text formats, so they present a greater challenge for automated processing tools. A comma-separated value (CSV) file is a straightforward text format that contains values separated by line breaks and commas. It provides an alternative to complicated spreadsheet formats. For example, the medal count from the 2010 Winter Olympics in CSV would be:

  Country,Gold,Silver,Bronze,Total
  USA,9,15,13,37
  Germany,10,13,7,30
  Canada,14,7,5,26
  Norway,9,8,6,23
  ...

Fortunately, the release of data in one format does not preclude its release in another format. Most spreadsheet programs provide an option to save data in CSV form. Agencies should release spreadsheet data in a textual format like CSV by default, but an agency should feel free to also release the data in XLS or other formats.

Similarly, agencies will sometimes release large files or groups of files in a compressed or bundled format (for example, ZIP, TAR, GZ, BZ). In these cases, agencies should prominently specify where users can freely obtain software and instructions for extracting the data. Because so many means of compressing and bundling files exist, agencies should not presume that the necessary tools and steps are obvious from the data files themselves.

The rules suggested throughout this series should be seen as best practices rather than hard-and-fast rules. We are still in the process of fleshing out several of these ideas ourselves, and exceptional cases sometimes justify exceptional treatment. In unusual cases, an agency may need to deviate from traditional best practices, but it should carefully consider (and perhaps document) its rationale for doing so. Rules are made to be broken, but they should not be broken for mere expedience.

Our hope is that this series will provide agencies with some points to consider prior to releasing data. Because of Data.gov and the increasing traction of openness and transparency initiatives, we expect to see many more datasets enter the public domain in the coming years. Some agencies will approach the release of bulk data with minimal previous experience. While this poses a challenge, it also present an opportunity for committed agencies to institute good practices early, before bad habits and poor-quality legacy datasets can accumulate. When releasing new datasets, agencies will make numerous conscious and unconscious choices that impact developers. We hope to help agencies understand developers' challenges when making these choices.

After gathering input from the community, we plan to create a technical report based on this series of posts. Thanks to numerous readers for insightful feedback; your comments have influenced and clarified our thoughts. If any FTT readers inside or outside of government have additional comments about this post or others, please do pass them along.

Correcting Errors and Making Changes

[This is the fourth post in a series on best practices for government datasets by Harlan Yu and me. (previous posts: 1, 2, 3)]

Even cautiously edited datasets sometimes contain errors, and even meticulously produced schemas require refinement as circumstances change. While errors or changes create inconvenience for developers, most developers appreciate and prepare for their inevitability. Agencies should strive to do the same. A well-developed strategy for fixes and changes can ease their burden on both developers and agencies.

When agencies release data, developers ideally will interact with it in creative new ways. Given datasets containing megabytes to gigabytes of data, novel uses will reveal previously unnoticed errors. Knowledge of these errors benefits the agency as well as other developers using the data, so agencies should take steps to encourage error reporting. Labels in a dataset allow developers to specify errors efficiently and unambiguously. An easy-to-find channel for reporting errors, such as a prominently provided email address or web form, is also critical. Tracking down the contact information of the person responsible for a dataset can be difficult, and a well-known channel reduces this barrier to feedback.

Upon learning of an issue in a dataset, an agency should correct the problem and release the corrected dataset in a timely manner. An important fact to keep in mind when correcting data is that numerous developers may have already downloaded and begun using the old flawed version. For these developers, even a minor modification can cause major issues if not done carefully. Agencies should think about two things: how they will make developers aware that the dataset has been modified and how they will change the dataset itself. The first point is sometimes ignored in spite of its importance. Not only should datasets contain version information, but agencies should also notify developers when the data that they rely on has changed. In particular, agencies should allow developers to subscribe to an email list or an RSS feed for specific datasets that details updates in a well-structured manner. These updates should clearly specify the dataset and version affected, a location where the updated dataset can be found, and a description of the changes to the dataset. When possible, these changes should be specified via a formal, structured description—for example, a diff output—as well as a brief prose explanation.

Correction of dataset contents should proceed cautiously. Suppose that an application allows user to comment on parts of a document. If labels are in a dataset are not maintained consistently across versions, the developer may need to painstakingly map comments from the old data to the corresponding parts of the new dataset. Issues like this can be mitigated through several practices. First, an agency should seek to preserve labels across versions of a dataset when possible (alternatively, in some cases an agency might wish to change the labels but provide a mapping to assist developers). For example, a dataset might aggregate numerous documents, and a minor change in one document should not necessarily change the labels for the other documents. Recall the side note from our previous post that labels should be separate from ordering information. Corrections to a dataset may add, remove, or reorder items. Detaching order from labels can help agencies ensure label consistency across dataset versions. In addition, the last post and its comments discussed whether agencies should provide a label that is separate from its internally used agency label. This separation allows labels to remain consistent even when Subsection X becomes Section Y based on the internal agency labels. Note that these points about consistent labeling can be useful whenever a dataset could have multiple versions: for example, consistent labeling might be beneficial across various versions of a bill.

Similarly, the structure that agencies use for datasets, the locations where the datasets are hosted, and other details of a dataset sometimes must change. Suppose that an agency releases various statistics each month. When the agency is asked to provide a new statistic, the new data may necessitate changes to the XML schema. Alternatively, the agency may decide to host data at the address "http://www.agency.gov/YEAR/MONTH/data.xml" rather than "http://www.agency.gov/MONTH-YEAR/data.xml," causing issues for automated tools that periodically check for and download new data. To reduce the adverse impact of these changes on developers, agencies should provide detailed notice of the changes as early as possible. Early notice gives developers time to modify their tools. These notifications can occur via an email list or RSS feed providing details of the changes in a clear, consistent format.

The possibility of changes and their impact on developers should be taken into account at all stages of the data production process. Suppose an agency adds an element to a schema that specifies a unique individual, but the schema may someday need to specify a corporation instead. Although the agency should not speculatively add unnecessary elements to the schema, it should be mindful of possible changes when designing the rest of the schema. Various design choices may minimize the impact of a change if necessary later. Agencies should also avoid the urge to alter a schema dramatically each time it requires a minor change. A major overhaul—even when done to clean up the schema—may require equally dramatic changes in tools utilizing the data. To ensure that developers notice changes to XML schemas, both schema files and datasets should contain a prominent schema version number. If an agency changes the location where data is hosted, it should consider temporarily using aliases so that requests using old addresses automatically take you to the correct data. Once the old addresses are phased out, agencies should use a standard HTTP 404 status code to indicate that the requested data was not found at the specified location. Simply supplying a "Not Found" page without this standard code could make life harder for developers whose automated tools must instead parse this page.

When making changes, agencies should consider soliciting input directly from developers. Because the preferences of developers might not be obvious, this input can lead to choices that help developers without increasing the burden on agencies. In fact, developers may even come up with ideas that make life easier for an agency.

Our next and final post in this series will discuss a handful of additional issues for agencies to consider.

Labeling Dataset Contents

[This is the third post in a series on best practices for government datasets by Harlan Yu and me. (previous posts)]

When the government releases a dataset, citizens ideally will discuss the contents and supply educated feedback. The ability to reference facts and figures in a dataset supports a constructive dialog. Vague concerns are harder to articulate and address than ones citing specific paragraphs in a document. In this post, we'll discuss why data labeling supports this goal, and when and how government agencies should uniquely label data inside a dataset for citability. As in the previous post, our focus will be on XML, though the lessons apply to other formats.

As our interactions with each other and with our government increasingly occur online, the need for precise communication has also increased. Open-government initiatives can give knowledge and voices to more citizens than ever before, but this can lead to an almost overwhelming quantity of discussion. Various technologies can help us to manage and make sense of this information, but these technologies are most effective with unambiguous data. For example, tools could sort citizens' comments on a bill by section, but this task can be difficult unless the comments cite sections. One way to encourage citations is by placing tags in the dataset that citizens and open-government tools can easily reference.

The structure of XML implicitly enables referencing of elements in a sense. A citizen could cite the seventh "<PARARGRAPH>" element in the twenty-eighth "<DOCUMENT>" element in a dataset. Even ignoring how error-prone counting is for humans, reliance on this structure is not ideal. XML schemas can specify order for elements of different types but not the same type—a parser could validly retrieve <PARAGRAPH> elements of a document in any order (we'll discuss in our next post why labels and ordering should be treated as two separate problems; our point here is only that element order should not be used as an implicit label). In addition, different parties may come up with different reference schemes in the absence of an explicit authoritative one. The agency creating a dataset might refer to the paragraph referenced above as Section XII of Document K6-2495, and another developer might refer to it as "<PARAGRAPH>" 147. An abundance of reference schemes can make it harder for government officials to understand citizens, harder for citizens to understand each other, and harder for developers to merge the function and output of their tools. Using an explicit common reference scheme avoids these issues.

Of course, different uses require different forms of labeling, and agencies cannot meet the desires of everyone. How can they decide where to add labels? Recall that our previous posts address the question of who should add what structure to a dataset. Agencies should use the answer as a guide for where to add labels, generally adding labels to all elements they create. If an agency breaks text up by paragraph, each paragraph should be citable; if it breaks text up by sentence, each sentence should be citable. Labels are fairly straightforward to add to elements in XML, so this rule imposes minimal additional work on agencies. Additional partitioning and labeling of data can be left to private parties. Some precedence already exists for private party involvement here: Citability.org is working to enable citation of government documents at a paragraph level.

When agencies add labels, they should strive to use the same reference schemes used internally. Unfortunately, labeling schemes utilizing Roman numerals, letters, or almost anything other than Arabic numerals (0, 1, 2, etc.) can be hard to process. For these cases, the agency should include two labels: an internal agency label and a numeric label. While this suggestion runs counter to our rule against redundancy, it makes the labels far easier to process and facilities easy translation between both schemes.

In general, however, the lessons from past posts should be kept in mind when labeling, including the points about avoiding redundancy: the label for Part 2 of a document should appear in element names and attributes (e.g., "<PART LABEL="2">[...]</PART>") rather than text. Labels should uniquely identify an element among those with the same parent, but a label may not be necessary if an element's type is unique among its siblings.

To make these recommendations more concrete, we end with an example. Consider the following document:

  Notice 2982:  Proposal to Increase Public Transit Fees

  Section I.  Budget Shortfall
  In fiscal year 2009, [...]
  Unless changes are made [...]

  Section II.  Decreasing the Deficit
  To compensate for [...]
  This relatively modest [...]

This document could be represented in a dataset as:

<DATASET>
  [...]
  <NOTICE LABEL="2982">
    <TITLE>Proposal to Increase Public Transit Fees</TITLE>
    <SECTION AGENCY_LABEL="I" LABEL="1">
      <TITLE>Budget Shortfall</TITLE>
      <PARAGRAPH LABEL="1">In fiscal year 2009, [...]</PARAGRAPH>
      <PARAGRAPH LABEL="2">Unless changes are made [...]</PARAGRAPH>
    </SECTION>
    <SECTION AGENCY_LABEL="II" LABEL="2">
      <TITLE>Decreasing the Deficit</TITLE>
      <PARAGRAPH LABEL="1">To compensate for [...]</PARAGRAPH>
      <PARAGRAPH LABEL="2">This relatively modest [...]</PARAGRAPH>
    </SECTION>
  </NOTICE>
  [...]
</DATASET>

Among other things, we can uniquely reference the notice (Notice 2982) and each paragraph (e.g., Notice 2982, Section II, paragraph 1).

In our next post, we'll discuss how agencies can handle errors and make other changes while reducing the strain on developers.

Basic Data Format Lessons

[This is the second post in a series on best practices for government datasets by Harlan Yu and me. (previous post)]

When creating a dataset, the preferences of developers may not be obvious to those producing the dataset. Seemingly innocuous choices by data providers can lead to major headaches for developers. In this post, we discuss some of the more basic challenges that developers encounter when working with a dataset. These lessons may seem trivial to our more technical readers, but they're often learned through experience. Our hope is to reduce this learning curve by explaining how various practices affect developers. We'll focus on XML datasets, but many of the topics apply to CSV and other data formats.

One of the hardest parts of working with a dataset can be figuring out what's in it and how it's organized. What data comes inside an "<FL47>" tag? Can a "<TEXT>" element ever contain a "<PARAGRAPH>" element? Developers rely heavily on documentation to explain the structure and contents of a dataset. When working with XML, one particularly relevant item is known as a schema. An XML schema is a separate file with an extension such as ".dtd" or ".xsd," and it provides a blueprint of the permitted structure for corresponding XML files. XML schema files tell developers where they can recover the information that they need from a dataset. These schema files and other documentation are often a necessity for developers, and they should be treated as such by data providers. Any XML file supplied by an agency should contain a complete URL address at which its schema can be found. Further, any link to an XML document on a government site should have prominent links near it for the corresponding schema file and reasonable documentation describing the contents of the dataset.

XML schema files can be seen as an informal contract between data providers and developers, effectively promising that a dataset will match the specified structure. Unfortunately, sometimes datasets contain flaws causing them not to match that structure. Although experienced developers produce software that detects the existence of structural errors, these errors can be difficult or impossible for them to isolate and correct. The people in the best position to catch and fix structural errors are the people producing a dataset. Numerous validation tools exist for ensuring that an XML document is well-formed and valid—that is, the document is structurally sound and matches its XML schema. Prior to releasing a dataset, an agency should run a validator on it to check for structural flaws. This sanity check can take just a few moments for an agency but save hours of developer time.

When deciding on the structure of a dataset, an agency should strive for simplicity while logically representing the underlying data. The addition of elements, attributes, or children in a schema can improve the quality and clarity of the dataset, but it can also add unnecessary complexity. When designing schemas, there's a tendency to include elements or other structure that will almost certainly go unused in practice. Schema designers may assume that extraneous items do no harm, but developers must cautiously account for them if allowed by schema. The result can be wasted developer time and increased software complexity. The true cost of various structural choices is not just the time necessary to encode these choices in a schema but also the burden these choices impose on developers. Additional structural complexity must provide a justifiable benefit.

In some cases, however, the addition of elements or attributes is not only justifiable but highly desirable for developers: logically distinct pieces of data should appear in separate XML elements or attributes. Suppose that a developer wishes to access a piece of data in a dataset. If the data is combined with other information, the developer will need to figure out how to extract it from the combined field. This extraction can be difficult, time-consuming, and prone to errors. For example, assume that a data provider includes the following element:

<DOCINFO>Doc No. 2001345--Released 01-01-2001</DOCINFO>

To extract the document number, a developer might look for all characters following "No." but before a dash. While this is straightforward enough, other parts of the same or future datasets might instead use the document number format "2001-345" or separate the document number and release date with a space rather than a double-dash. Neither case would lead to invalid XML, but both would break the developer's extraction tool. Now consider this alternative:

<DOCINFO>
  <DOC_NO>2001345</DOC_NO>
  <RELEASE_DATE>01-01-2001</RELEASE_DATE>
</DOCINFO>

Using extra elements to separate logically distinct data can prevent extraction errors. This lesson often applies even when the combined data is related. For example, the version number 5.3.2 could be broken into major version 5, minor version 3, and revision 2. In general, agencies should separate such items themselves when they can do so more easily than developers.

Even when the basic structure of a dataset is ideal, choices about how to provide data inside this structure can affect developers. Developers thrive on consistency. Suppose that a dataset details various costs. Consider all possible ways of writing cost: $4,300, 5938.37, 74 dollars and 63 cents, etc. Unless an agency decides on, documents, and adheres to a standard format, developers' software must handle a large number of possibilities to avoid unexpected surprises. Consistency in a dataset can make a developer's life far easier, and it reduces the possibility that surprises will break an application. Note that a schema can be helpful for enforcing consistency for certain fields—for example, cost might be defined as a decimal field with a constraint on the number of fractional digits.

Redundant information is another source of difficulty for developers. Redundancy can appear in numerous ways. Suppose that a dataset contains the element "<VERSION>Version 5</VERSION>." The word "Version" is unnecessary, and developers must go through additional trouble to extract the version number. In so doing, developers must consider the possibility that "Version" could be misspelled, abbreviated, or omitted. Supplying a version number alone ("<VERSION>5</VERSION>") would avoid this issue altogether. More subtly, suppose that a dataset contains all bills introduced in Congress on a certain date:

<INTRODUCED_BILLS>
  <DATE>11-12-2014</DATE>
  <HOUSE_BILLS DATE="NOV 12, 2014">
    [...]
  </HOUSE_BILLS>
  <SENATE_BILLS DATE="NOV 12, 2014">
    [...]
  </SENATE_BILLS>
</INTRODUCED_BILLS>

Date information appears three times even though it must be the same in all cases. The more often a piece of information appears in a dataset, the more likely that inconsistencies will occur. These inconsistencies can lead to software errors requiring manual resolution. While redundancy can serve as a sanity check for errors, agencies typically should perform this check themselves if possible before releasing the data. After all, the agency is in the best position to fix inconsistencies. Unless well-justified, agencies should avoid redundancy.

Processing datasets often requires a significant amount of developer time, so adherence to even basic rules can dramatically increase innovation. What other low-level recommendations do FTT readers have for non-developers producing datasets?

Tomorrow, we'll discuss how labeling elements in a dataset can help developers.

Government Datasets That Facilitate Innovation

[This is the first post in a series on best practices for government datasets by Harlan Yu and me.]

There's a growing consensus that the government can increase its openness and transparency by publishing its raw data in bulk online. As several Freedom to Tinker contributors argued in Government Data and the Invisible Hand, publishing data empowers third party software developers to produce innovative new technologies that engage citizens and illuminate government's inner workings. With the establishment of Data.gov and the federal Open Government Initiative, federal agencies are quickly embracing a culture of machine-readable data release, and many states and municipalities are now following their lead.

But how usable are these datasets for developers? The answer lies primarily in the structure and contents of the datasets themselves. While all data in digital form is technically machine-readable in some sense, the ease of use for machine-readable datasets can vary widely. In fact, machine-readability is just a baseline requirement: a developer can't start to work with a dataset until it's in this form. Once that minimum standard is met, the critical factor is how easy it is for developers to use the dataset in new, innovative ways.

In this series of posts, we'll draw on our experience building applications that use government data to offer some thoughts about best practices government could follow in releasing data. By taking a few straightforward steps in preparing its datasets, government can make the data much more useful to developers.

One key factor in determining ease of use for developers is the structure of the dataset, and that is the topic of our first post. Let's start with a trivial example:

<BOOK>A Tale of Two Cities by Charles Dickens. Chapter 1. The Period. It was the best of times, it was the worst of times [...] The end.</BOOK>

This is a "well-formed" XML version of Dicken's "A Tale of Two Cities" in its entirety. Though more usable than a PDF copy of the book, the XML document lacks basic structure and is not particularly helpful to a developer building tools to display or analyze the book. Compare that to:

<BOOK>
  <HEADER>
    <TITLE>A Tale of Two Cities</TITLE>
    <AUTHOR>Charles Dickens</AUTHOR>
  </HEADER>
  <BODY>
    <CHAPTER NUMBER="1">
      <TITLE>The Period</TITLE>
      <PARAGRAPH NUMBER="1">
        <SENTENCE NUMBER="1">It was the best of times [...]</SENTENCE>
      </PARAGRAPH>
      [...]
    </CHAPTER>
    [...]
  </BODY>
</BOOK>

This data is far more structured, and a developer can take it and immediately do lots of new things. If the developer plans to build an interface for a new e-book reader for instance, it's easy to extract the component parts of the book for appropriate formatting. With the less-structured version, the developer needs to guess where chapters, titles, and paragraphs begin and end. Because manual analysis is infeasible for large, complex datasets, developers who have only minimally-structured data will need to build automated processing scripts to make these guesses. Developing these scripts can be difficult and time-consuming, and data quality will suffer because the scripts will inevitably make mistakes.

Whether a dataset facilitates innovative uses by developers is not a yes or no question but a matter of degree, and it depends largely on the quality of the data's structure and the needs of specific developers. In deciding what structure to add, agencies should consider who is in the best position to add various types of structure to the data. Sometimes, the agency is in the best position. Employees of an agency may amass specialized knowledge about the data, or the agency may already internally store the data with structural details like explicit database columns. In these cases, the agency can provide this structure with little effort, relieving developers from the potentially Herculean task of reconstructing these details. In other cases, the agency may have no significant advantage over private parties.

Agencies should get as close to this dividing line as is reasonably possible to broaden the range of creative possibilities for application developers. The goal is to minimize structural obstacles that might prevent developers from tinkering with the data. Better structure leads to more innovative tools, a more transparent government, and a greater appreciation for the work done by federal agencies.

Over our next several posts, we'll discuss choices that agencies make when releasing datasets and the ways these choices affect developers. Among other things, we'll explore basic data format lessons, data labeling, and correction/modification of datasets. Our goal is to turn this series into a best practices white paper for government use, and we'd appreciate any comments, suggestions, or insights from readers.

Information Technology Policy in the Obama Administration, One Year In

[Last year, I wrote an essay for Princeton's Woodrow Wilson School, summarizing the technology policy challenges facing the incoming Obama Administration. This week they published my follow-up essay, looking back on the Administration's first year. Here it is.]

Last year I identified four information technology policy challenges facing the incoming Obama Administration: improving cybersecurity, making government more transparent, bringing the benefits of technology to all, and bridging the culture gap between techies and policymakers. On these issues, the Administration's first-year record has been mixed. Hopes were high that the most tech-savvy presidential campaign in history would lead to an equally transformational approach to governing, but bold plans were ground down by the friction of Washington.

Cybersecurity : The Administration created a new national cybersecurity coordinator (or "czar") position but then struggled to fill it. Infighting over the job description -- reflecting differences over how to reconcile security with other economic goals -- left the czar relatively powerless. Cyberattacks on U.S. interests increased as the Adminstration struggled to get its policy off the ground.

Government transparency: This has been a bright spot. The White House pushed executive branch agencies to publish more data about their operations, and created rules for detailed public reporting of stimulus spending. Progress has been slow -- transparency requires not just technology but also cultural changes within government -- but the ship of state is moving in the right direction, as the public gets more and better data about government, and finds new ways to use that data to improve public life.

Bringing technology to all: On the goal of universal access to technology, it's too early to tell. The FCC is developing a national broadband plan, in hopes of bringing high-speed Internet to more Americans, but this has proven to be a long and politically difficult process. Obama's hand-picked FCC chair, Julius Genachowski, inherited a troubled organization but has done much to stabilize it. The broadband plan will be his greatest challenge, with lobbyists on all sides angling for advantage as our national network expands.

Closing the culture gap: The culture gap between techies and policymakers persists. In economic policy debates, health care and the economic crisis have understandably taken center stage, but there seems to be little room even at the periphery for the innovation agenda that many techies had hoped for. The tech policy discussion seems to be dominated by lawyers and management consultants, as in past Administrations. Too often, policymakers still see techies as irrelevant, and techies still see policymakers as clueless.

In recent days, creative thinking on technology has emerged from an unlikely source: the State Department. On the heels of Google's surprising decision to back away from the Chinese market, Secretary of State Clinton made a rousing speech declaring Internet freedom and universal access to information as important goals of U.S. foreign policy. This will lead to friction with the Chinese and other authoritarian governments, but our principles are worth defending. The Internet can a powerful force for transparency and democratization, around the world and at home.

Syndicate content