November 21, 2024

What We Lose if We Lose Data.gov

In its latest 2011 budget proposal, Congress makes deep cuts to the Electronic Government Fund. This fund supports the continued development and upkeep of several key open government websites, including Data.gov, USASpending.gov and the IT Dashboard. An earlier proposal would have cut the funding from $34 million to $2 million this year, although the current proposal would allocate $17 million to the fund.

Reports say that major cuts to the e-government fund would force OMB to shut down these transparency sites. This would strike a significant blow to the open government movement, and I think it’s important to emphasize exactly why shuttering a site like Data.gov would be so detrimental to transparency.

On its face, Data.gov is a useful catalog. It helps people find the datasets that government has made available to the public. But the catalog is really a convenience that doesn’t necessarily need to be provided by the government itself. Since the vast majority of datasets are hosted on individual agency servers—not directly by Data.gov—private developers could potentially replicate the catalog with only a small amount of effort. So even if Data.gov goes offline, nearly all of the data still exist online, and a private developer could go rebuild a version of the catalog, maybe with even better features and interfaces.

But Data.gov also plays a crucial behind the scenes role, setting standards for open data and helping individual departments and agencies live up to those standards. Data.gov establishes a standard, cross-agency process for publishing raw datasets. The program gives agencies clear guidance on the mechanics and requirements for releasing each new dataset online.

There’s a Data.gov manual that formally documents and teaches this process. Each agency has a lead Data.gov point-of-contact, who’s responsible for identifying publishable datasets and for ensuring that when data is published, it meets information quality guidelines. Each dataset needs to be published with a well-defined set of common metadata fields, so that it can be organized and searched. Moreover, thanks to Data.gov, all the data is funneled through at least five stages of intermediate review—including national security and privacy reviews—before final approval and publication. That process isn’t quick, but it does help ensure that key goals are satisfied.

When agency staff have data they want to publish, they use a special part of the Data.gov website, which outside users never see, called the Data Management System (DMS). This back-end administrative interface allows agency points-of-contact to efficiently coordinate publishing activities agency-wide, and it gives individual data stewards a way to easily upload, view and maintain their own datasets.

My main concern is that this invaluable but underappreciated infrastructure will be lost when IT systems are de-funded. The individual roles and responsibilities, the informal norms and pressures, and perhaps even the tacit authority to put new datasets online would likely also disappear. The loss of structure would probably mean that sharply reduced amounts of data will be put online in the future. The datasets that do get published in an ad hoc way would likely lack the uniformity and quality that the current process creates.

Releasing a new dataset online is already a difficult task for many agencies. While the current standards and processes may be far from perfect, Data.gov provides agencies with a firm footing on which they can base their transparency efforts. I don’t know how much funding is necessary to maintain these critical back-end processes, but whatever Congress decides, it should budget sufficient funds—and direct that they be used—to preserve these critically important tools.

Assessing PACER's Access Barriers

The U.S. Courts recently conducted a year-long assessment of their Electronic Public Access program which included a survey of PACER users. While the results of the assessment haven’t been formally published, the Third Branch Newsletter has an interview with Bankruptcy Judge J. Rich Leonard that discusses a few high-level findings of the survey. Judge Leonard has been heavily involved in shaping the evolution of PACER since its inception twenty years ago and continues to lead today.

The survey covered a wide range of PACER users—“the courts, the media, litigants, attorneys, researchers, and bulk data collectors”—and Judge Leonard claims they found “a remarkably high level of satisfaction”: around 80% of those surveyed were “satisfied” or “very satisfied” with the service.

If we compare public access before we had PACER to where we are now, there is clearly much success to celebrate. But the key question is not only whether current users are satisfied with the service but also whether PACER is reaching its entire audience of potential users. Are there artificial obstacles preventing potential PACER users—who admittedly would be difficult to poll—from using the service? The satisfaction statistic may be fine at face value, assuming that a representative sample of users were polled, but it could be misleading if it’s being used to gauge the overall success of PACER as a public access system.

One indicator of obstacles may be another statistic cited by Judge Leonard: “about 45% of PACER users also use CM/ECF,” the Courts’ electronic case management and filing system. To put it another way, nearly half of all PACER users are currently attorneys who practice federal law.

That number seems inordinately high to me and suggests that significant barriers to public access may exist. In particular, account registration requires all users to submit a valid credit card for billing (or alternatively a valid home address to receive log-in credentials and billing statements by mail.) Even if users’ credit cards are never charged, this registration hurdle may already turn away many potential PACER users at the door.

The other barrier is obviously the cost itself. With a few exceptions, users are forced to pay a fee for each document they download, at a metered rate of eight-cents per page. Judge Leonard asserts that “surprisingly, cost ranked way down” in the survey and that “most people thought they paid a fair price for what they got.”

But this doesn’t necessarily imply that cost isn’t a major impediment to access. It may just be that those surveyed—primarily lawyers—simply pass the cost of using PACER down to their clients and never bear the cost themselves. For the rest of PACER users who don’t have that luxury, the high cost of access can completely rule out certain kinds of legal research, or cause users to significantly ration and monitor their usage (as is the case even in the vast majority of our nation’s law libraries), or wholly deter users from ever using the service.

Judge Leonard rightly recognizes that it’s Congress that has authorized the collection of user fees, rather than using general taxpayer money, to fund the electronic public access program. But I wish the Courts would at least acknowledge that moving away from a fee-based model, to a system funded by general appropriations, would strengthen our judicial process and get us closer to securing each citizen’s right to equal protection under the law.

Rather than downplaying the barriers to public access, the Courts should work with Congress to establish a way forward to support a public access system that is truly open. They should study and report on the extent to which Congress already funds PACER indirectly, through Executive and Legislative branch PACER fee payments to the Judiciary, and re-appropriate those funds directly. If there is a funding shortfall, and I assume there will be, they should study the various options for closing that gap, such as additional direct appropriations or a slight increase in certain filing fees.

With our other two branches of government making great strides in openness and transparency with the help of technology, the Courts similarly needs to transition away from a one-size-fits-all approach to information dissemination. Public access to the courts will be fundamentally transformed by a vigorous culture of civic innovation around federal court documents, and this will only happen if the Courts confront today’s access barriers head-on and break them down.

(Thanks to Daniel Schuman for pointing me to the original article.)

Broadband Politics and Closed-Door Negotiations at the FCC

The last seven days at the FCC have been drama-filled, and that’s not something you can often say about an administrative agency. As I noted in my last post, the FCC is considering reclassifying broadband as a “common carrier” service. This would subject the access portion of the service to some additional regulations which currently do not apply, but have (to some extent) been applied in the past. Last Thursday, the FCC voted 3-2 along party lines to pursue a Notice of Inquiry about this approach and others, in order to help solidify its ability to enforce consumer protections and implement the National Broadband Plan in the wake of the Comcast decision in the DC Circuit Court. There was a great deal of politicking and rhetoric around the vote. Then, on Monday, the Wall Street Journal reported that lobbyists were engaged in closed-door meetings at the FCC, discussing possible legislative compromises that would obviate the need for reclassification. This led to public outcry from everyone who was not involved in the meetings, and allegations of misconduct by the FCC for its failure to disclose the meetings. If you sit through my description of the intricacies of reclassification, I promise to give you the juicy bits about the controversial meetings.

The Reclassification Vote and the NOI
As I explained in my previous post, the FCC faces a dilemma. The DC Circuit said it did not have the authority under Title I of the Communications Act to enforce the broadband openness principles it espoused in 2005. This cast into doubt the FCC’s ability to not only police violations of the principles but also to implement many portions of the National Broadband Plan. In the past, the Commission would have had unquestioned authority under Title II of the Act, but in a series of decisions from 2002-2007 it voluntarily “deregulated” broadband by classifying it as a Title I service. Chairman Genachowski has floated what he calls a “Third Way” approach in which broadband is not classified as a Title I service anymore, and is not subject to all provisions of Title II, but instead is classified under Title II but with extensive “forbearance” from portions of that title.

From a legal perspective, the main question is whether the FCC has the authority to reclassify the transmission component of broadband internet service as a Title II service. This gets into intricacies of how broadband service fits into statutory definitions of “information service” (aka Title I), “telecommunications”, “telecommunications service” (aka Title II), and the like. I was going to lay these out in detail, but in the interest of getting to the juicy stuff I will simply direct you to Harold Feld’s excellent post. For the “Third Way” approach to work, the FCC’s interpretation of a “telecommunications service” will have to be articulated to include broadband internet access while not also swallowing a variety of internet services that everyone thinks should remain unregulated — sites like Facebook, content delivery networks like Akamai, and digital media providers like Netflix. However, this narrow definition must not be so narrow that the FCC does not have jurisdiction to police the types of practices it is concerned about (for instance, providers should not be able to discriminate in their delivery of traffic simply by moving the discrimination from their transport layer of the network to the logical layer, or by partnering with an affiliated “ISP” that does discrimination for them). I am largely persuaded of Harold’s arguments, but the AT&T lobbyists present the other side as well. One argument that I don’t see anyone making (yet) is that presuming the transmission component is subject to Title II, the FCC would seem to have a much stronger argument for exercising ancillary jurisdiction with respect to interrelated components like non-facilities-based ISPs that rely on that transmission component.

The other legal debate involves an even more arcane discussion about whether — assuming there is a “telecommunications service” offered as part of broadband service — that “telecommunications service” is something that can be regulated separately from the other “information services” (Title I) that might be offered along with it. This includes things like an email address from your provider, DNS, Usenet, and the like. Providers have historically argued that these were inseparable from the internet access component, and the so-called “Stevens Report” of 1998 introduced the notion that the “inextricably intertwined” nature of broadband service might have the result of classifying all such services as entirely Title I “information services.” To the extent that this ever made any sense, it is far from true today. What consumers believe they are purchasing is access to the internet, and all of those other services are clearly extricable from a definitional and practical standpoint (indeed, customers can and do opt for competitors for all of them on a regular basis).

But none of these legal arguments are at the fore of the current debate, which is almost entirely political. Witness, for example, John Boehner’s claim that the “Third Way” approach was a “government takeover of the Internet,” Fred Upton’s (R-MI) claim that the approach is a “blind power grab,” modest Democratic sign-on to an industry-penned and reasoning-free opposition letter, and an attempt by Republican appropriators to block funding for the FCC unless they swore off the approach. This prompted a strong response from Democratic leaders indicating that any such effort would not see the light of day. Ultimately, the FCC voted in favor of the NOI to explore the issue. Amidst this tumult, the WSJ reported that the FCC had started closed-door meetings with industry representatives in order to discuss a possible legislative compromise.

Possible Legislation and Secret Meetings
It is not against the rules to communicate with the FCC about active proceedings. Indeed, such communications are part of a healthy policymaking process that solicits input from stakeholders. The FCC typically conducts proceedings under the “permit but disclose” regime in which all discussions pertaining to the given proceeding must be described in “ex parte” filings on the docket. Ars has a good overview of the ex parte regime. The NOI passed last week is subject to these rules.

It therefore came as a surprise that a subset of industry players were secretly meeting with the FCC to discuss possible legislation that could make the NOI irrelevant. This issue is made even more egregious by the fact that the FCC just conducted a proceeding on improving ex parte disclosures, and the Chairman remarked:

“Given the complexity and importance of the issues that come before us, ex parte communications remain an essential part of our deliberative process. It is essential that industry and public stakeholders know the facts and arguments presented to us in order to express informed views.”

The Chairman’s Chief of Staff Edward Lazarus sought to explain away the obligation for ex parte disclosure, and nevertheless attached a brief disclosure letter from the meeting attendees that didn’t describe any of the details. There is perhaps a case to be made that the legislative options do not directly fall under the subject matter of the NOI, but even if this position were somehow legally justifiable it clearly falls afoul of the policy intent of the ex parte rules. Harold Feld has a great post in which he describes his nomination for “Worsht Ex Parte Ever“. The letter attached to the Lazarus post would certainly take the title if it were a formal ex parte letter. The industry participants in the meetings deserve some criticism, but ultimately the problems can only be resolved by the FCC by demanding comprehensive openness rather than perpetuating a culture of loopholes.

The public outcry continues, from both public interest groups and in the comments on the Lazarus post. If it’s true that the FCC admits internally that “they f*cked up”, they should do far more to regain the public’s trust in the integrity of the notice-and-comment process.

Update: The Lazarus post was just updated to replace the link to the brief disclosure letter with two new links to letters that describe themselves as Ex Parte letters. The first contains the exact same text as the original, and the second has a few bullet points.

Release Government Data, Early and Often

One of the key axioms of modern open government is that all public data should be published online in a raw but usable form. Usability in this case is aimed at software programmers. By making government datasets more usable, programmers are more likely to innovate in the civic sphere and build technologies, using the raw data, to enhance the relationships among citizens and with government.

The open government community has provided plenty of valuable guidance about what usability means for programmers. We proclaim that all datasets need to be: published in a format that is reasonably structured and machine-processable; well-documented; downloadable in bulk; authenticated using cryptographic digital signatures; version-controlled; permanent and citable; and the list goes on and on. These are all worthy principles to be sure, and all government datasets should strive to meet them.

But you’ll be hard-pressed to find any government datasets that exist with all of these principles pre-satisfied. While some are in better shape than others, most datasets would make programmers cringe. Data often only exist as informal working sets in proprietary Excel spreadsheets. Sometimes they are in structured databases, but schemas are undocumented, field values are ambiguous, and the semantics are only understood by the employee who created them. Datasets have errors and biases that are known but never explicitly corrected.

For a civil servant who is a data caretaker looking over the laundry list of publishing principles, there’s frequently a huge quality chasm between the dataset she owns and how people are asking to see it released. To her, publishing this data adequately just seems like a lot of extra work. The more attractive alternative is to put off the data publishing—it’s not in her job description or evaluations anyway—and move on to other work instead.

How can this chasm be bridged? A widely-adopted philosophy in software development and entrepreneurship would serve open government data well: release early and release often. And listen to your customers.

In the software development world, a working version of the product is pushed out as soon as possible even with known imperfections—an “alpha” release—so it can be subject to real use by early adopters. Early adopters can provide helpful feedback about what works, what’s broken, and what new features would be most useful to them. The software developers then iterate quickly. They incorporate the suggested fixes and features into their code and release an updated version of the product to their users. The virtuous cycle then starts again. Under this philosophy, software developers can be efficient about how to best improve their code where it matters, and users get software that works better and has more features they desire.

The “release early, release often” philosophy should be applied to government data. For the initial release, data caretakers should take the path of least resistance to get data out the door. This means publishing datasets in whatever format is most convenient, along with as much documentation as can reasonably be mustered. Documentation is especially important with an “alpha” dataset—proper warnings about its problems, instabilities and inductive limitations must be prominently displayed. (Of course, the usual privacy and legal caveats should also be applied.) Sometimes, the “alpha” release will be “good enough” for programmers to start their work, and this will minimize any superfluous work done by caretakers. This is the virtue of “release early.”

In other cases, programmers will need assistance using the dataset and will notice problem spots with the initial release. The dataset might be confusing, contain errors or be difficult to work with. A tight feedback mechanism allows the programmer to get help quickly and continue to innovate, while the data caretaker can fix problems based on real use cases and add clarifying metadata into an updated version of the dataset. Data quality and usability increases for those working with the dataset, both in and outside of government. That’s the virtue of “release often.”

And here is the big opportunity for government: no platform currently exists to engage the prime audience for government data—software programmers. Without a tight feedback mechanism, the virtuous cycle of mutual benefit cannot exist. Government is missing its best opportunity to improve data quality by neglecting useful feedback from programmers who are actually tinkering with the datasets. Society is losing out on potentially game-changing civic innovations, which otherwise would have been built if data were more usable and the uncertainty of failure reduced.

A terrific start in turning the corner would be for government to adopt an issue-tracking system for its datasets. As a public venue, it would help ensure that data caretakers are prompt in addressing developer concerns. It would also allow caretakers to organize feedback in a formal way. Such platforms are commonplace in any successful software development venture. The same needs to be true for government data in order to drive rapid quality improvements and increase developer engagement.

CITP Expands Scope of RECAP

Today, we’re thrilled to announce the next version of our RECAP technology, dramatically expanding the scope of the project.

Having had some modest success at providing public access to legal documents, we’re now taking the next logical step, offering easy public access to illegal documents.

The Internet Archive, which graciously hosts RECAP’s repository of legal documents, was strangely unreceptive to our offer to let them store the world’s most comprehensive library of illegal documents. Fortunately, the Pirate Bay was happy to step in and help.

Interested in seeing what’s available? Then you might want to watch our brief instructional video.