April 18, 2024

Archives for March 2010

Round 2 of the PACER Debate: What to Expect

The past year has seen an explosion of interest in free access to the law. Indeed, something of a movement appears to be coalescing around the issue, due in no small part to the growing Law.gov effort (see the latest list of events). One subset of this effort is our work on PACER, the online document access system for the federal courts. We contend that access to electronic court records should be free (see posts from me, Tim, and Harlan). Our RECAP project helps make some of these documents more accessible, and has gained adoption far above our expectations. That being said, RECAP doesn’t solve the fundamental problem: the federal government needs to publish the full public record for free online. Today, this argument came from an unlikely source, the FCC’s National Broadband Plan.

RECOMMENDATION 15.1: the primary legal documents of the federal government should be free and accessible to the public on digital platforms. […]

– For the Judicial branch, this should apply to all judicial opinions.

[…] Finally, all federal judicial decisions should be accessible for free and made publicly available to the people of the United States. Currently, the Public Access to Court Electronic Records system charges for access to federal appellate, district and bankruptcy court records.[7] As a result, U.S. federal courts pay private contractors approximately $150 million per year for electronic access to judicial documents.[8] [Steve note: The correct figure is $150m over 10 years. However it is quite possible that the federal government as a whole spends $150m or more per year for access to case materials.] While the E-Government Act has mandated that this system change so that this information is as freely available as possible, little progress has been made.[9] Congress should consider providing sufficient funds to publish all federal judicial opinions, orders and decisions online in an easily accessible, machine-readable format.

[7] See Public Access To Court Electronic Records—Overview, http://pacer.psc.uscourts.gov/pacerdesc.html (last visited Jan. 7, 2010).
[8] Carl Malmud, President and CEO, Public.Resource. Org., By the People, Address at the Gov 2.0 Summit, Washington, D.C. 25 (Sept. 10, 2009), available at http://resource.org/people/3waves_cover.pdf
[9] See Letter from Sen. Joseph I. Lieberman to Carl Malamud, President and CEO, Public.Resources.Org (Oct. 13, 2009), available at http://bulk.resource.org/courts.gov/foia/gov.senate.lieberman_20091013_from.pdf

This issue is outside of the Commission’s direct jurisdiction, but the Broadband Plan is intended as a blueprint for the federal government as a whole. In that context, the notion of ensuring that primary legal materials are available for free online fits perfectly with a broader effort to make government digitally accessible. In a similar vein, a bill was introduced today by Rep. Israel. The Public Online Information Act, backed by the Sunlight Foundation, creates a new federal advisory committee to advise all three branches of government on how to make government information available online for free.

To establish an advisory committee to issue nonbinding government-wide guidelines on making public information available on the Internet, to require publicly available Government information held by the executive branch to be made available on the Internet, to express the sense of Congress that publicly available information held by the legislative and judicial branches should be available on the Internet, and for other purposes.

These two developments are the first of what I expect to be many announcements in the coming months, coming from places like the transparency caucus. These announcements will share a theme — there is a growing mandate for universal free access to government information, and judicial information is a key component of that mandate. These requirements will increasingly go to the heart of full free access to the public record, and will reveal the discrepancies between different branches in this regard.

The FCC’s language doesn’t quite get everything right. Most notably, the language focuses on opinions even though there are other components of the record that are key to the public’s understanding of the law. Opinions on PACER are already theoretically free, but the kludgy system for accessing them doesn’t include all of the opinions, isn’t indexable by search engines, and only gives a minimal amount of information about the case that each is a part of. Furthermore, the docket text required to understand the context, and the search functionality required to find the opinions both require a fee. Subsequent calls for free access to case materials will have to be more holistic than the opinions-only language of the Broadband Report.

The POIA language is also a step forward. A federal advisory committee is a good thing in the context of a branch that is more accustomed to the adversarial process than notice-and-comment. However, we will need much more concrete requirements before we will have achieved our goals.

In the context of these announcements, the Administrative Office of the Courts made their own announcement today. The Judicial conference has voted in favor of two measures that make incremental improvements on the current pay-wall model of access to PACER.

  • Adjust the Electronic Public Access fee schedule so that users are not billed unless they accrue charges of more than $10 of PACER usage in a quarterly billing cycle, in effect quadrupling the amount of data available without charge. Currently, users are not billed until their accounts total at least $10 in a one-year period.
  • Approve a pilot in up to 12 courts to publish federal district and bankruptcy court opinions via the Government Printing Office’s Federal Digital System (FDsys) so members of the public can more easily search across opinions and across courts.

These are minor tweaks on a fundamentally limited system. Don’t get me wrong — a world with these changes is better than a world without. It is slightly easier to avoid spending more than $10 in a given quarter than in a given year, but it’s nevertheless likely that you will do so unless you know exactly what you are looking for and retrieve only a few documents. It’s also good to establish precedent for GPO publishing case materials, but that doesn’t require a limited trial that could end in bureaucratic quagmire. The GPO can handle publishing many documents, and any reasonably qualified software engineer could figure out how to deliver them in short order. What’s more, the courts could provide universal free public access today, with zero engineering work: offer a single PACER login that is never billed or, better yet, just stop billing all accounts.

The next round of the PACER debate will be over whether or not we make a fundamental change in access to federal court records, or if we concede minor tweaks and call it a day.

Global Internet Freedom and the U.S. Government

Over the past two weeks I’ve testified in both the Senate and the House on how the U.S. should advance “Internet freedom.” I submitted written testimony for both hearings which can be downloaded in PDF form here and here. Full transcripts will become available eventually but meanwhile you can click here to watch the Senate video and here to watch the House video. In both hearings I advocated a combination of corporate responsibility through the Global Network Initiative backed up by appropriate legislation given that some companies seem reluctant to hold themselves accountable voluntarily; revision of export controls and sanctions; and finally, funding and support for tools, and technologies and activism platforms that will counter-act suppression of online speech.

Lawmakers are moving forward to support research and technical development. February 4th Rep. David Wu [D-OR] and Rep. Frank Wolf [R-VA] introduced the Internet Freedom Act of 2010, which would establish an Internet Freedom Foundation. The bill’s core section reads:

(a) ESTABLISHMENT OF THE INTERNET FREEDOM FOUNDATION. – The National Science Foundation shall establish the Internet Freedom Foundation. The Internet Freedom Foundation shall –

(1) award competitive, merit-reviewed grants, cooperative aggreements, or contracts to private industry, universities, and other research and development organizations to develop deployable technologies to defeat Internet suppression and censorship; and
(2) award incentive prizes to private industry, universities, and other research and development organizations to develop deployable technologies to defeat Internet suppression and censorship.

(b) LIMITATION ON AUTHORITY. – Nothing in this Act shall be interpreted to authorize any action by the United States to interfere with foreign national censorship in furtherance of law enforcement aims that are consistent with the International Covenant on Civil and Political Rights.

Whoever runs this foundation will have their work cut out for them in sorting out its strategies, goals, and priorities – and dealing with a great deal of thorny politics. The Falun Gong-affiliated Global Internet Freedom Consortium have been arguing that they were unfairly passed over for recent State Department grants which were given to other groups working on different tools that help you get around Internet blocking – “circumvention tools” as the technical community call them. For the past year they’ve been engaged in an aggressive campaign to lobby congress and the media to ensure they’ll get a slice of future funds. (For examples of the fruits of their media lobbying effort see here, here, and here).

But the unfortunate bickering over who deserves government funding more than whom has distracted attention from the larger question of whether circumvention on its own is sufficient to defeat Internet censorship and suppression of online speech. In his recent blog post, Internet Freedom: Beyond Circumvention my friend and former colleague Ethan Zuckerman warns against an over-focus on circumvention: “We can’t circumvent our way around internet censorship.” In short, he summarizes his main points:

– Internet circumvention is hard. It’s expensive. It can make it easier for people to send spam and steal identities.
– Circumventing censorship through proxies just gives people access to international content – it doesn’t address domestic censorship, which likely affects the majority of people’s internet behavior.
– Circumventing censorship doesn’t offer a defense against DDoS or other attacks that target a publisher.

While circumvention tools remain worthy of support as part of a basket of strategies, I agree with Ethan that circumvention is never going to be the silver bullet that some people make it out to be, for all the reasons he outlines in his blog post, which deserves to be read in full. As Ethan points out, as I pointed out in my own testimony, and as my research on Chinese blog censorship published last year has demonstrated, circumvention does nothing to help you access content that has been removed from the Internet completely – which is the main way that the Chinese government now censors the Chinese-language Internet. In my testimony I suggested several other tools and activities that require equal amount of focus:

  • Tools and training to help people evade surveillance, detect spyware, and guard against cyber-attacks.
  • Mechanisms to preserve and re-distribute censored content in various languages.
  • Platforms through which citizens around the world can share “opposition research” about what different governments are doing to suppress online speech, and collaborate across borders to defeat censorship, surveillance, and attacks in ad-hoc, flexible ways as new problems arise during times of crisis.

As Ethan puts it:

– We need to shift our thinking from helping users in closed societies access blocked content to helping publishers reach all audiences. In doing so, we may gain those publishers as a valuable new set of allies as well as opening a new class of technical solutions.

– If our goal is to allow people in closed societies to access an online public sphere, or to use online tools to organize protests, we need to bring the administrators of these tools into the dialog. Secretary Clinton suggests that we make free speech part of the American brand identity – let’s find ways to challenge companies to build blocking resistance into their platforms and to consider internet freedom to be a central part of their business mission. We need to address the fact that making their platforms unblockable has a cost for content hosts and that their business models currently don’t reward them for providing service to these users.

Which brings us to the issue of corporate responsibility for free expression and privacy on the Internet. I’ve been working with the Global Network Initiative for the past several years to develop a voluntary code of conduct centered on a set of basic principles for free expression and privacy based on U.N. documents like the Universal Declaration of Human Rights, the International Covenant on Civil and Political Rights, and other international legal conventions. It is bolstered by a set of implementation guidelines and evaluation and accountability mechanisms, supported by a multi-stakeholder community of human rights groups, investors, and academics all dedicated to helping companies do the right thing and avoid making mistakes that restrict free expression and privacy on the Internet.

So far, however, only Google, Microsoft, and Yahoo have joined. Senator Durbin’s March 2nd Senate hearing focused heavily on the question of why other companies have so far failed to join, what it would take to persuade them to join, and if they don’t join whether laws should be passed that induce greater public accountability by companies on free expression and privacy. He has written letters to 30 U.S. companies in the information and communications technology (ICT) sector. He expressed great displeasure in the hearing with most of their responses, and further disappointment that no company (other than Google which is already in the GNI) even had the guts to send a representative to testify in the hearing.  Durbin announced that he will “introduce legislation that would require Internet companies to take reasonable steps to protect human rights or face civil or criminal liability.” It is my understanding that his bill is still under construction, and it’s not clear when he will introduce it (he’s been rather preoccupied with healthcare and other domestic issues, after all).  Congressman Howard Berman (D-CA), who convened Wednesday’s House Foreign Affairs Committee hearing is also reported to be considering his own bill. Rep. Chris Smith (R-NJ), the ranking Republican at that hearing, made a plug for the Global Online Freedom Act of 2009, a somewhat revised version of a bill that he first introduced in 2006

I said at the hearing that the GNI probably wouldn’t exist if it hadn’t been for the threat of Smith’s legislation. I was not, however, asked my opinion on GOFA’s specific content. Since GOFA’s 2006 introduction I have critiqued it a number of times (see for example here, here, and here). As the years have passed – especially in the past year as the GNI got up and running yet most companies have still failed to engage meaningfully with it  – I have come to see the important role legislation could play in setting industry-wide standards and requirements, which companies can then tell governments they have no choice but to follow. That said, I continue to have concerns about parts of GOFA’s approach. Here is a summary of the current bill written by the Congressional Research Service (I have bolded the parts of concern):

5/6/2009–Introduced.
Global Online Freedom Act of 2009 – Makes it U.S. policy to: (1) promote the freedom to seek, receive, and impart information and ideas through any media; (2) use all appropriate instruments of U.S. influence to support the free flow of information without interference or discrimination; and (3) deter U.S. businesses from cooperating with Internet-restricting countries in effecting online censorship. Expresses the sense of Congress that: (1) the President should seek international agreements to protect Internet freedom; and (2) some U.S. businesses, in assisting foreign governments to restrict online access to U.S.-supported websites and government reports and to identify individual Internet users, are working contrary to U.S. foreign policy interests. Amends the Foreign Assistance Act of 1961 to require assessments of electronic information freedom in each foreign country. Establishes in the Department of State the Office of Global Internet Freedom (OGIF). Directs the Secretary of State to annually designate Internet-restricting countries. Prohibits, subject to waiver, U.S. businesses that provide to the public a commercial Internet search engine, communications services, or hosting services from locating, in such countries, any personally identifiable information used to establish or maintain an Internet services account. Requires U.S. businesses that collect or obtain personally identifiable information through the Internet to notify the OGIF and the Attorney General before responding to a disclosure request from an Internet-restricting country. Authorizes the Attorney General to prohibit a business from complying with the request, except for legitimate foreign law enforcement purposes. Requires U.S. businesses to report to the OGIF certain Internet censorship information involving Internet-restricting countries. Prohibits U.S. businesses that maintain Internet content hosting services from jamming U.S.-supported websites or U.S.-supported content in Internet-restricting countries. Authorizes the President to waive provisions of this Act: (1) to further the purposes of this Act; (2) if a country ceases restrictive activity; or (3) if it is the national interest of the United States.

My biggest concern has to do with the relationship GOFA would create between U.S. companies and the U.S. Attorney General. If the AG is made arbiter of whether content or account information requested by local law enforcement is for “legitimate law enforcement purposes” or not, that means the company has to share the information – which in the case of certain social networking services may include a great deal of non-public information about the user, who his or her friends are, and what they’re saying to each other in casual conversation. Letting the U.S. AG review the insides of this person’s account would certainly violate that user’s privacy. It also puts companies at a competitive disadvantage in markets where users – even those who don’t particularly like their own government – would consider an overly close relationship between a U.S. service provider and the U.S. government not to be in their interest. Take this hypothetical situation for example: An Egyptian college student decides to use a social networking site to set up a group protesting the arrest and torture of his brother. The Egyptian government demands the group be shut down and all account information associated with it handed over. In order to comply with GOFA, the company shares this student’s account information and all content associated with that protest group with the U.S. Attorney General. What is the oversight to ensure that this information is not retained and shared with other U.S. government agencies interested in going on a fishing expedition to explore friendships among members of different Egyptian opposition groups? Why should we expect that user to be ok with such a possibility?

Another difficult issue to get right – which gets even harder with the advent of cloud computing – is the question of where user data is physically housed. The Center for Democracy and Technology,(PDF), Jonathan Zittrain and others have discussed some of the regulatory difficulties of personally identifiable information and its location. In 2008 Zittrain wrote:

As Internet law rapidly evolves, countries have repeatedly and successfully demanded that information be controlled or monitored, even when that information is hosted outside their borders. Forcing US companies to locate their servers outside IRCs [Internet Restricting Countries] would only make their services less reliable; it would not make them less regulable.

If the goal of GOFA is to discourage US companies from violating human rights, then it will probably be successful. But if the goal of the Act is to make the Internet more free and more safe, and not just push rights violations on foreign companies, then more must be done.

Then there is the problem of Internet Restricting Country designations themselves. I have long argued that it is problematic to divide the world into “internet restricting countries” and countries where we can assume everything is just fine, not to worry, no human rights concerns present. First of all I think that the list itself is going to quickly turn into a political and diplomatic football which will be subject to huge amounts of lobbying and politics, and thus will be very difficult to add new countries to the list. Secondly, regimes can change fast: in between annual revisions of the list you can have a coup or a rigged election whose victors demand companies to hand over dissident account information and censor political information, but companies are off the hook – having “done nothing illegal.” Finally, while I am not drawing moral equivalence between Italy and Iran I do believe there is no country on earth, including the United States, where companies are not under pressure by government agencies to do things that arguably violate users’ civil rights. Policy that acknowledges this honestly is less likely to hurt U.S. companies in many parts of the world where the last thing they need is for people to be able to provide “documentary proof” that they are extensions of the U.S. government’s geopolitical agendas.

Therefore a more effective, ethically consistent and less hypocritical approach to the three problems I’ve described above would be to codify strict global privacy standards absolutely everywhere U.S. companies operate. Companies should be required by law to notify all users anywhere in the world in a clear, culturally and linguistically understandable way (not by trained lawyers but by normal people), exactly how and where their personally-identifying information is being stored and used and who has access to it under what circumstances. If users are better informed about how their data is being used, they can use better judgment about how or whether to use different commercial services – and seek more secure alternatives when necessary, perhaps even using some of the new tools and platforms run by non-profit activist organizations that Congress is hoping to fund. Congress could further bolster the privacy of global users of U.S. services by adopting something akin to the Council of Europe Privacy Convention.

Regarding censorship: again, as the Internet evolves further with semi-private social networking sites and mobile services we need to make sure that the information companies are required to share with the U.S. government doesn’t end up violating user privacy.  I am doubtful that government agenices in some of the democracies unlikely to be put on the “internet restricting countries” list can really be trusted not to abuse the systems of censorship and intermediary liability that a growing number of democracies are implementing in the name of legitimate law enforcement purposes. Thus on censorship I also prefer global standards. There is real value in making companies retain internal records of the censorship requests that they receive all around the world in the event of a challenge in U.S. court regarding the lawfulness of a particular act of censorship – a private right of action in U.S. court which GOFA or its equivalent would potentially enable. It’s also good to make companies establish clear and uniform procedures for how they handle censorship requests, so that they can prove if challenged in court that they are only responding to requests made in writing through official legal channels, rather than responding to requests that have no basis even in local law, despite claiming vaguely to the public that “we are only following local law.” Companies should be required to exercise maximum transparency with users about what is being censored, at whose behest, and according to which law exactly. Congress could, for example, mandate that the Chilling Effects Clearinghouse mechanism or something similar should be utilized globally for all content takedowns.
(Originally posted at my blog, RConversation.)

Netflix Cancels the Netflix Prize 2

Today, Netflix announced it is canceling its plans for a second Netflix Prize contest, one that reportedly would have involved the release of more information than the first. As I argued earlier, I feared that the new contest would have put the supposedly private movie viewing and rating habits of Netflix customers at great risk, and I applaud Netflix for making a very responsible decision. No doubt, pressure from the private lawsuit and FTC investigation helped Netflix make up its mind, and both are reportedly going away as a result of today’s action.

Best Practices for Government Datasets: Wrap-Up

[This is the fifth and final post in a series on best practices for government datasets by Harlan Yu and me. (previous posts: 1, 2, 3, 4)]

For our final post in this series, we’ll discuss several issues not touched on by earlier posts, including data signing and the use of certain non-text file formats. The relatively brief discussions of these topics should not be interpreted as an indicator of their importance. The topics simply did not fit cleanly into earlier posts.

One significant omission from earlier posts is the issue of data signing with digital signatures. Before discussing this issue, let’s briefly discuss what a digital signature is. Suppose that you want to email me an IOU for $100. Later, I may want to prove that the IOU came from you—it’s of little value if you can claim that I made it up. Conversely, you may want the ability to prove whether the document has been altered. Otherwise, I could claim that you owe me $100,000.

Digital signatures help in proving the origin and authenticity of data. These signatures require that you create two related big numbers, known as keys: a private signing key (known only by you) and a public verification key. To generate a digital signature, you plug the data and your signing key into a complicated formula. The formula spits out another big number known a digital signature. Given the signature and your data, I can use the verification key to prove that the data came unmodified from you. Similarly, nobody can credibly sign modified data without your signing key—so you should be very careful to keep this key a secret.

Developers may want to ensure the authenticity of government data and to prove that authenticity to users. At first glance, the solution seems to be a simple application of digital signatures: agencies sign their data, and anyone can use the signatures to authenticate an agency’s data. In spite of their initially steep learning curve, tools like GnuPG provide straightforward file signing. In practice, the situation is more complicated. First, an agency must decide what data to sign. Perhaps a dataset contains numerous documents. Developers and other users may want signatures not only for the full dataset but also for individual documents in it.

Once an agency knows what to sign, it must decide who will perform the signing. Ideally, the employee producing the dataset would sign it immediately. Unfortunately, this solution requires all such employees to understand the signature tools and to know the agency’s signing key. Widespread distribution of the signing key increases the risk that it will be accidentally revealed. Therefore, a central party is likely to sign most data. Once data is signed, an agency must have a secure channel for delivering the verification key to consumers of the data—users cannot confirm the authenticity of signed data without this key. While signing a given file with a given key may not be hard, surrounding issues are more tricky. We offer no simple solution here, but further discussion of this topic between government agencies, developers, and the public could be useful for all parties.

Another issue that earlier posts did not address is the use of non-text spreadsheet formats, including Microsoft Excel’s XLS format. These formats can sometimes be useful because they allow the embedding of formulas and other rich information along with the data. Unfortunately, these formats are far more complex than raw text formats, so they present a greater challenge for automated processing tools. A comma-separated value (CSV) file is a straightforward text format that contains values separated by line breaks and commas. It provides an alternative to complicated spreadsheet formats. For example, the medal count from the 2010 Winter Olympics in CSV would be:

  Country,Gold,Silver,Bronze,Total
  USA,9,15,13,37
  Germany,10,13,7,30
  Canada,14,7,5,26
  Norway,9,8,6,23
  ...

Fortunately, the release of data in one format does not preclude its release in another format. Most spreadsheet programs provide an option to save data in CSV form. Agencies should release spreadsheet data in a textual format like CSV by default, but an agency should feel free to also release the data in XLS or other formats.

Similarly, agencies will sometimes release large files or groups of files in a compressed or bundled format (for example, ZIP, TAR, GZ, BZ). In these cases, agencies should prominently specify where users can freely obtain software and instructions for extracting the data. Because so many means of compressing and bundling files exist, agencies should not presume that the necessary tools and steps are obvious from the data files themselves.

The rules suggested throughout this series should be seen as best practices rather than hard-and-fast rules. We are still in the process of fleshing out several of these ideas ourselves, and exceptional cases sometimes justify exceptional treatment. In unusual cases, an agency may need to deviate from traditional best practices, but it should carefully consider (and perhaps document) its rationale for doing so. Rules are made to be broken, but they should not be broken for mere expedience.

Our hope is that this series will provide agencies with some points to consider prior to releasing data. Because of Data.gov and the increasing traction of openness and transparency initiatives, we expect to see many more datasets enter the public domain in the coming years. Some agencies will approach the release of bulk data with minimal previous experience. While this poses a challenge, it also present an opportunity for committed agencies to institute good practices early, before bad habits and poor-quality legacy datasets can accumulate. When releasing new datasets, agencies will make numerous conscious and unconscious choices that impact developers. We hope to help agencies understand developers’ challenges when making these choices.

After gathering input from the community, we plan to create a technical report based on this series of posts. Thanks to numerous readers for insightful feedback; your comments have influenced and clarified our thoughts. If any FTT readers inside or outside of government have additional comments about this post or others, please do pass them along.

Correcting Errors and Making Changes

[This is the fourth post in a series on best practices for government datasets by Harlan Yu and me. (previous posts: 1, 2, 3)]

Even cautiously edited datasets sometimes contain errors, and even meticulously produced schemas require refinement as circumstances change. While errors or changes create inconvenience for developers, most developers appreciate and prepare for their inevitability. Agencies should strive to do the same. A well-developed strategy for fixes and changes can ease their burden on both developers and agencies.

When agencies release data, developers ideally will interact with it in creative new ways. Given datasets containing megabytes to gigabytes of data, novel uses will reveal previously unnoticed errors. Knowledge of these errors benefits the agency as well as other developers using the data, so agencies should take steps to encourage error reporting. Labels in a dataset allow developers to specify errors efficiently and unambiguously. An easy-to-find channel for reporting errors, such as a prominently provided email address or web form, is also critical. Tracking down the contact information of the person responsible for a dataset can be difficult, and a well-known channel reduces this barrier to feedback.

Upon learning of an issue in a dataset, an agency should correct the problem and release the corrected dataset in a timely manner. An important fact to keep in mind when correcting data is that numerous developers may have already downloaded and begun using the old flawed version. For these developers, even a minor modification can cause major issues if not done carefully. Agencies should think about two things: how they will make developers aware that the dataset has been modified and how they will change the dataset itself. The first point is sometimes ignored in spite of its importance. Not only should datasets contain version information, but agencies should also notify developers when the data that they rely on has changed. In particular, agencies should allow developers to subscribe to an email list or an RSS feed for specific datasets that details updates in a well-structured manner. These updates should clearly specify the dataset and version affected, a location where the updated dataset can be found, and a description of the changes to the dataset. When possible, these changes should be specified via a formal, structured description—for example, a diff output—as well as a brief prose explanation.

Correction of dataset contents should proceed cautiously. Suppose that an application allows user to comment on parts of a document. If labels are in a dataset are not maintained consistently across versions, the developer may need to painstakingly map comments from the old data to the corresponding parts of the new dataset. Issues like this can be mitigated through several practices. First, an agency should seek to preserve labels across versions of a dataset when possible (alternatively, in some cases an agency might wish to change the labels but provide a mapping to assist developers). For example, a dataset might aggregate numerous documents, and a minor change in one document should not necessarily change the labels for the other documents. Recall the side note from our previous post that labels should be separate from ordering information. Corrections to a dataset may add, remove, or reorder items. Detaching order from labels can help agencies ensure label consistency across dataset versions. In addition, the last post and its comments discussed whether agencies should provide a label that is separate from its internally used agency label. This separation allows labels to remain consistent even when Subsection X becomes Section Y based on the internal agency labels. Note that these points about consistent labeling can be useful whenever a dataset could have multiple versions: for example, consistent labeling might be beneficial across various versions of a bill.

Similarly, the structure that agencies use for datasets, the locations where the datasets are hosted, and other details of a dataset sometimes must change. Suppose that an agency releases various statistics each month. When the agency is asked to provide a new statistic, the new data may necessitate changes to the XML schema. Alternatively, the agency may decide to host data at the address “http://www.agency.gov/YEAR/MONTH/data.xml” rather than “http://www.agency.gov/MONTH-YEAR/data.xml,” causing issues for automated tools that periodically check for and download new data. To reduce the adverse impact of these changes on developers, agencies should provide detailed notice of the changes as early as possible. Early notice gives developers time to modify their tools. These notifications can occur via an email list or RSS feed providing details of the changes in a clear, consistent format.

The possibility of changes and their impact on developers should be taken into account at all stages of the data production process. Suppose an agency adds an element to a schema that specifies a unique individual, but the schema may someday need to specify a corporation instead. Although the agency should not speculatively add unnecessary elements to the schema, it should be mindful of possible changes when designing the rest of the schema. Various design choices may minimize the impact of a change if necessary later. Agencies should also avoid the urge to alter a schema dramatically each time it requires a minor change. A major overhaul—even when done to clean up the schema—may require equally dramatic changes in tools utilizing the data. To ensure that developers notice changes to XML schemas, both schema files and datasets should contain a prominent schema version number. If an agency changes the location where data is hosted, it should consider temporarily using aliases so that requests using old addresses automatically take you to the correct data. Once the old addresses are phased out, agencies should use a standard HTTP 404 status code to indicate that the requested data was not found at the specified location. Simply supplying a “Not Found” page without this standard code could make life harder for developers whose automated tools must instead parse this page.

When making changes, agencies should consider soliciting input directly from developers. Because the preferences of developers might not be obvious, this input can lead to choices that help developers without increasing the burden on agencies. In fact, developers may even come up with ideas that make life easier for an agency.

Our next and final post in this series will discuss a handful of additional issues for agencies to consider.