November 21, 2024

Government Datasets That Facilitate Innovation

[This is the first post in a series on best practices for government datasets by Harlan Yu and me.]

There’s a growing consensus that the government can increase its openness and transparency by publishing its raw data in bulk online. As several Freedom to Tinker contributors argued in Government Data and the Invisible Hand, publishing data empowers third party software developers to produce innovative new technologies that engage citizens and illuminate government’s inner workings. With the establishment of Data.gov and the federal Open Government Initiative, federal agencies are quickly embracing a culture of machine-readable data release, and many states and municipalities are now following their lead.

But how usable are these datasets for developers? The answer lies primarily in the structure and contents of the datasets themselves. While all data in digital form is technically machine-readable in some sense, the ease of use for machine-readable datasets can vary widely. In fact, machine-readability is just a baseline requirement: a developer can’t start to work with a dataset until it’s in this form. Once that minimum standard is met, the critical factor is how easy it is for developers to use the dataset in new, innovative ways.

In this series of posts, we’ll draw on our experience building applications that use government data to offer some thoughts about best practices government could follow in releasing data. By taking a few straightforward steps in preparing its datasets, government can make the data much more useful to developers.

One key factor in determining ease of use for developers is the structure of the dataset, and that is the topic of our first post. Let’s start with a trivial example:

<BOOK>A Tale of Two Cities by Charles Dickens. Chapter 1. The Period. It was the best of times, it was the worst of times [...] The end.</BOOK>

This is a “well-formed” XML version of Dicken’s “A Tale of Two Cities” in its entirety. Though more usable than a PDF copy of the book, the XML document lacks basic structure and is not particularly helpful to a developer building tools to display or analyze the book. Compare that to:

<BOOK>
  <HEADER>
    <TITLE>A Tale of Two Cities</TITLE>
    <AUTHOR>Charles Dickens</AUTHOR>
  </HEADER>
  <BODY>
    <CHAPTER NUMBER="1">
      <TITLE>The Period</TITLE>
      <PARAGRAPH NUMBER="1">
        <SENTENCE NUMBER="1">It was the best of times [...]</SENTENCE>
      </PARAGRAPH>
      [...]
    </CHAPTER>
    [...]
  </BODY>
</BOOK>

This data is far more structured, and a developer can take it and immediately do lots of new things. If the developer plans to build an interface for a new e-book reader for instance, it’s easy to extract the component parts of the book for appropriate formatting. With the less-structured version, the developer needs to guess where chapters, titles, and paragraphs begin and end. Because manual analysis is infeasible for large, complex datasets, developers who have only minimally-structured data will need to build automated processing scripts to make these guesses. Developing these scripts can be difficult and time-consuming, and data quality will suffer because the scripts will inevitably make mistakes.

Whether a dataset facilitates innovative uses by developers is not a yes or no question but a matter of degree, and it depends largely on the quality of the data’s structure and the needs of specific developers. In deciding what structure to add, agencies should consider who is in the best position to add various types of structure to the data. Sometimes, the agency is in the best position. Employees of an agency may amass specialized knowledge about the data, or the agency may already internally store the data with structural details like explicit database columns. In these cases, the agency can provide this structure with little effort, relieving developers from the potentially Herculean task of reconstructing these details. In other cases, the agency may have no significant advantage over private parties.

Agencies should get as close to this dividing line as is reasonably possible to broaden the range of creative possibilities for application developers. The goal is to minimize structural obstacles that might prevent developers from tinkering with the data. Better structure leads to more innovative tools, a more transparent government, and a greater appreciation for the work done by federal agencies.

Over our next several posts, we’ll discuss choices that agencies make when releasing datasets and the ways these choices affect developers. Among other things, we’ll explore basic data format lessons, data labeling, and correction/modification of datasets. Our goal is to turn this series into a best practices white paper for government use, and we’d appreciate any comments, suggestions, or insights from readers.

Federal Health IT Effort Is Making Progress, Could Benefit from More Transparency

President Obama has indicated that health information technology (HIT) is an important component of his administration’s health care goals. Politicians on both sides of the aisle have lauded the potential for HIT to reduce costs and improve care. In this post, I’ll give some basics about what HIT is, what work is underway, and how the government can get more security experts involved.

We can coarsely break HIT into three technical areas. The first area is the transition from paper to electronic records, which involves surprisingly many subtle technical issues like interoperability. Second, development of health information networks will allow sharing of patient data between medical facilities and with other appropriate parties. Third, as a recent National Research Council report discusses, digital records can enable research in new areas, such as cognitive support for physicians.

HIT was not created on the 2008 campaign trail. The Department of Veterans Affairs (VA) has done work in this area for decades, including its widely praised VistA system, which provides electronic patient records and more. Notably, VistA source code and documentation can be freely downloaded. Many other large medical centers also already use electronic patient records.

In 2004, then-President Bush pushed for deployment of a Nationwide Health Information Network (NHIN) and universal adoption of electronic patient records by 2014. The NHIN is essentially a nationwide network for sharing relevant patient data (e.g., if you arrive at an emergency room in Oregon, the doctor can obtain needed records from your regular doctor in Kansas). The Department of Health and Human Services (HHS) funded four consortia to develop smaller, localized networks, partially as a learning exercise to prepare for the NHIN. HHS has held a number of forums where members of these consortia, the government, and the public can meet and discuss timely issues.

The agendas for these forums show some positive signs. Sessions cover a number of tricky issues. For example, participants in one session considered the risk that searches for a patient’s records in the NHIN could yield records for patients with similar attributes, posing privacy concerns. Provided that meaningful conversations occurred, HHS appears to be making a concerted effort to ensure that issues are identified and discussed before settling on solutions.

Unfortunately, the academic information security community seems divorced from these discussions. Whether before or after various proposed systems are widely deployed, members of the community are eventually likely to analyze them. This analysis would be preferable earlier. In spite of the positive signs mentioned, past experience shows that even skilled developers can produce insecure systems. Any major flaws uncovered may be embarrassing, but weaknesses found now would be cheaper and easier to fix than ones found in 2014.

A great way to draw constructive scrutiny is to ensure transparency in federally funded HIT work. Limited project details are often available online, but both high- and low-level details can be hard to find. Presumably, members of the NHIN consortia (for example) developed detailed internal documents containing use cases, perceived risks/threats, specifications, and architectural illustrations.

To the extent legally feasible, the government should make documents like these available online. Access to them would make the projects easier to analyze, particularly for those of us less familiar with HIT. In addition, a typical vendor response to reported vulnerabilities is that the attack scenario is unrealistic (this is a standard response of e-voting vendors). Researchers can use these documents to ensure that they consider only realistic attacks.

The federal agenda for HIT is ambitious and will likely prove challenging and expensive. To avoid massive, costly mistakes, the government should seek to get as many eyes as possible on the work that it funds.

District Court Ruling in MDY v. Blizzard

Today, an Arizona District Court issued its ruling in the MDY v. Blizzard case, which involves contract, copyright, and DMCA claims. The claims addressed at trial were fairly limited because the Court entered summary judgment on several claims last summer. In-court comments by lawyers suggest that the case is headed toward appeal in the Ninth Circuit. Since I served as an expert witness in the case, I’ll withhold comment in this forum at this time, but readers are free to discuss it.

Economic Growth, Censorship, and Search Engines

Economic growth depends on an ability to access relevant information. Although censorship prevents access to certain information, the direct consequences of censorship are well-known and somewhat predictable. For example, blocking access to Falun Gong literature is unlikely to harm a country’s consumer electronics industry. On the web, however, information of all types is interconnected. Blocking a web page might have an indirect impact reaching well beyond that page’s contents. To understand this impact, let’s consider how search results are affected by censorship.

Search engines keep track of what’s available on the web and suggest useful pages to users. No comprehensive list of web pages exists, so search providers check known pages for links to unknown neighbors. If a government blocks a page, all links from the page to its neighbors are lost. Unless detours exist to the page’s unknown neighbors, those neighbors become unreachable and remain unknown. These unknown pages can’t appear in search results — even if their contents are uncontroversial.

When presented with a query, search engines respond with relevant known pages sorted by expected usefulness. Censorship also affects this sorting process. In predicting usefulness, search engines consider both the contents of pages and the links between pages. Links here are like friendships in a stereotypical high school popularity contest: the more popular friends you have, the more popular you become. If your friend moves away, you become less popular, which makes your friends less popular by association, and so on. Even people you’ve never met might be affected.

“Popular” web pages tend to appear higher in search results. Censoring a page distorts this popularity contest and can change the order of even unrelated results. As more pages are blocked, the censored view of the web becomes increasingly distorted. As an aside, Ed notes that blocking a page removes more than just the offending material. If censors block Ed’s site due to an off-hand comment on Falun Gong, he also loses any influence he has on information security.

These effects would typically be rare and have a disproportionately small impact on popular pages. Google’s emphasis on the long tail, however, suggests that considerable value lies in providing high-quality results covering even less-popular pages. To avoid these issues, a government could allow limited individuals full web access to develop tools like search engines. This approach seems likely to stifle competition and innovation.

Countries with greater censorship might produce lower-quality search engines, but Google, Yahoo, Microsoft, and others can provide high-quality search results in those countries. These companies can access uncensored data, mitigating the indirect effects of censorship. This emphasizes the significance of measures like the Global Network Initiative, which has a participant list that includes Google, Yahoo, and Microsoft. Among other things, the initiative provides guidelines for participants regarding when and how information access may be restricted. The effectiveness of this specific initiative remains to be seen, but such measures may provide leading search engines with greater leverage to resist arbitrary censorship.

Search engines are unlikely to be the only tools adversely impacted by the indirect effects of censorship. Any tool that relies on links between information (think social networks) might be affected, and repressive states place themselves at a competitive disadvantage in developing these tools. Future developments might make these points moot: in a recent talk at the Center, Ethan Zuckerman mentioned tricks and trends that might make censorship more difficult. In the meantime, however, governments that censor information may increasingly find that they do so at their own expense.