April 20, 2014

avatar

Pseudonyms: The Natural State of Online Identity

I’ve been writing recently about the problems that arise when you try to use cryptography to verify who is at the other end of a network connection. The cryptographic math works, but that doesn’t mean you get the identity part right.

You might think, from this discussion, that crypto by itself does nothing — that cryptographic security can only be bootstrapped from some kind of real-world identity verification. That’s the way it works for website certificates, where a certificate authority has to check your bona fides before it will issue you a certificate.

But this intuition turns out to be wrong. There is one thing that crypto can do perfectly, without any real-world support: providing pseudonyms. Indeed, crypto is so good at supporting pseudonyms that we can practically say that pseudonyms are the natural state of identity online.

To explain why this is true, I need to offer a gentle introduction to a basic crypto operation: digital signatures. Suppose John Doe (“JD”) wants to use digital signatures. First, JD needs to create a private cryptographic key, which he does by generating some random numbers and combining them according to a special geeky recipe. The result is a unique private key that only JD knows. Next, JD uses a certain procedure to determine the public key that corresponds to his private key. He announces the public key to everyone. The math guarantees that (1) JD’s public key is unique and corresponds to JD’s private key, and (2) a person who knows JD’s public key can’t figure out JD’s private key.

Now JD can make digital signatures. If JD wants to “sign” a certain message M, he combines M with JD’s private key in a special way, and the result is JD’s “signature on M”. Now anybody can verify the signature, using JD’s public key. Only JD can make the signature, because only JD knows JD’s private key; but anybody can verify the signature.

At no point in this process does JD tell anybody who he is — I called him “John Doe” for a reason. Indeed, JD’s public key is a perfect pseudonym: it conveys nothing about JD’s actual identity, yet it has a distinct “owner” whose presence can be verified. (“You’re really the person who created this public key? Then you should be able to make a signature on the message ‘squeamish ossifrage’ for me….”)

Using this method, anybody can make up a fresh pseudonym whenever they want. If you can generate random numbers and do some math (or have your computer do those things for you), then you can make a fresh pseudonym. You can make as many as you want, without needing to coordinate with anybody. This is all easy to do.

These methods, pseudonyms and signatures, are used even in cases where we want to verify somebody’s real-world identity. When you connect to (say) https://mail.google.com, Google’s web server gives you its public key — a pseudonym — along with a digital certificate that attests that that public key — that pseudonym — belongs to Google Inc. Binding public keys — pseudonyms — to real-world identities is tedious and messy, but of course this is often necessary in practice.

Online, identities are hard to manage. Pseudonyms are easy.

avatar

China, the Internet and Google: what I planned to say

In the run-up to and aftermath of Google’s decision yesterday to remove its Chinese search engine from China, I wrote two posts on my personal blog: Chinese netizens’ open letter to the Chinese government and Google and “One Google, One World; One China, No Google”

Today, the Congressional Executive China Commission conducted a hearing titled Google and Internet Control in China: A Nexus Between Human Rights and Trade? They had originally invited me to testify in a similarly titled hearing, “China, the Internet and Google,” which was postponed and rescheduled twice: the first attempt was foiled by the Great Snowcalypse; the second attempt scheduled for March 1st was postponed again at the last minute for some reason that isn’t entirely clear. Meanwhile I had already gone and written my testimony, improved by very helpful input from the CITP community. Unfortunately, when they rescheduled the hearing they said I was no longer invited. They wanted the hearing to have different witnesses from recent related hearings in both the House and Senate. Given that I appeared in both hearings it seems reasonable that they’d want to hear from some other people.

Given the effort that went into my testimony, however, and since it drills down in a lot more detail on China than my testimony for the other hearings, I think there is some value in my sharing it with the world. Here is the PDF and here it is as a web page. Some highlights:

From the introduction:

China is pioneering a new kind of Internet-age authoritarianism. It is demonstrating how a non-democratic government can stay in power while simultaneously expanding domestic Internet and mobile phone use.  In China today there is a lot more give-and-take between government and citizens than in the pre-Internet age, and this helps bolster the regime’s legitimacy with many Chinese Internet users who feel that they have a new channel for public discourse. Yet on the
other hand, as this Commission’s 2009 Annual Report clearly outlined, Communist Party control over the bureaucracy and courts has strengthened over the past
decade, while the regime’s institutional commitments to protect the universal rights and freedoms of all its citizens have weakened.

Google’s public complaint about Chinese cyber-attacks and censorship occurred against this backdrop.  It reflects a recognition that China’s status quo – at least when it comes to censorship, regulation,and manipulation of the Internet – is unlikely to improve any time soon, and
may in fact continue to get worse.

Overview of Chinese Internet controls

Chinese government attempts to control online speech began in the late 1990’s with a focus on the filtering or “blocking” of Internet content. Today, the government deploys an expanding repertoire of tactics.

In other words, filtering is just one of many ways that the Chinese government limits and controls speech on the Internet. The full text then gives descriptions and explanations of the other tactics, but in brief they include:

  • deletion or removal of content at the source
  • device and local-level controls
  • domain name controls
  • localized disconnection or restriction
  • self-censorship due to surveillance
  • cyber-attacks
  • government “astro-turfing” and “outreach”
  • targeted police intimidation

I then describe a number of efforts by Chinese netizens to push back against these tactics, which include (see the full text for further explanation):

  • informal anti-censorship support networks
  • distributed web-hosting assistance networks
  • crowdsourced “opposition research”
  • preservation and redistribution of censored content
  • humorous “viral” protests
  • public persuasion efforts

I end with a set of recommendations. Once again, see the full text for explanations, but here is the basic list:

  • anti-censorship tools – including outreach and education in their use
  • anonymity and security tools – to help people better defend against cyber-attacks, spyware, and surveillance
  • platforms and networks for the capture, storage, and redistribution of content that gets deleted from domestic social networking and publishing services
  • support for “opposition research” – remember the Chinese netizens who deconstructed Green Dam?
  • corporate responsibility – see Global Network Initiative, but also appropriate legislation if American and other Western Internet companies fail to accept the idea that they have some obligations as far as free expression and privacy are concerned
  • private right of action – so that Chinese victims can sue U.S. companies in U.S. courts
  • incentives for innovation by the private sector that helps Chinese Internet users access blocked sites as well as protect themselves from attacks and surveillance.

My conclusion:

Many of China’s 384 million Internet users are engaged in passionate debates about their communities’ problems, public policy concerns, and their nation’s future. Unfortunately these public discussions are skewed, blinkered, and manipulated – thanks to political censorship and surveillance. The Chinese people are proud of their nation’s achievements and generally reject critiques by outsiders even if they agree with some of them. A democratic alternative to China’s Internet-age authoritarianism will only be viable if it is conceived and built by the Chinese people from within. In helping Chinese “netizens” conduct an un-manipulated and un-censored discourse about their future, the United States will not imposing its will on the Chinese people, but rather helping the Chinese people to take ownership over their own future.

avatar

CITP is a Google Summer of Code 2010 Mentoring Organization

The Google Summer of Code program provides student stipends for summer work on open source projects. CITP is thrilled to have been chosen as a mentoring organization for 2010, meaning that students will be working on some CITP projects this summer. We think that these projects are very interesting, and potential participants now have the opportunity to propose their ideas for what they’d like to work on. Applications accepted from March 29 to April 9.

You can browse our list of project ideas, read our overall description, and apply here.

avatar

Side-Channel Leaks in Web Applications

Popular online applications may leak your private data to a network eavesdropper, even if you’re using secure web connections, according to a new paper by Shuo Chen, Rui Wang, XiaoFeng Wang, and Kehuan Zhang. (Chen is at Microsoft Research; the others are at Indiana.) It’s a sobering result — yet another illustration of how much information can be leaked by ordinary web technologies. It’s also really clever.

Here’s the background: Secure web connections encrypt traffic so that only your browser and the web server you’re visiting can see the contents of your communication. Although a network eavesdropper can’t understand the requests your browser sends, nor the replies from the server, it has long been known that an eavesdropper can see the size of the request and reply messages, and that these sizes sometimes leak information about which page you’re viewing, if the request size (i.e., the size of the URL) or the reply size (i.e., the size of the HTML page you’re viewing) is distinctive.

The new paper shows that this inference-from-size problem gets much, much worse when pages are using the now-standard AJAX programming methods, in which a web “page” is really a computer program that makes frequent requests to the server for information. With more requests to the server, there are many more opportunities for an eavesdropper to make inferences about what you’re doing — to the point that common applications leak a great deal of private information.

Consider a search engine that autocompletes search queries: when you start to type a query, the search engine gives you a list of suggested queries that start with whatever characters you have typed so far. When you type the first letter of your search query, the search engine page will send that character to the server, and the server will send back a list of suggested completions. Unfortunately, the size of that suggested completion list will depend on which character you typed, so an eavesdropper can use the size of the encrypted response to deduce which letter you typed. When you type the second letter of your query, another request will go to the server, and another encrypted reply will come back, which will again have a distinctive size, allowing the eavesdropper (who already knows the first character you typed) to deduce the second character; and so on. In the end the eavesdropper will know exactly which search query you typed. This attack worked against the Google, Yahoo, and Microsoft Bing search engines.

Many web apps that handle sensitive information seem to be susceptible to similar attacks. The researchers studied a major online tax preparation site (which they don’t name) and found that it leaks a fairly accurate estimate of your Adjusted Gross Income (AGI). This happens because the exact set of questions you have to answer, and the exact data tables used in tax preparation, will vary based on your AGI. To give one example, there is a particular interaction relating to a possible student loan interest calculation, that only happens if your AGI is between $115,000 and $145,000 — so that the presence or absence of the distinctively-sized message exchange relating to that calculation tells an eavesdropper whether your AGI is between $115,000 and $145,000. By assembling a set of clues like this, an eavesdropper can get a good fix on your AGI, plus information about your family status, and so on.

For similar reasons, a major online health site leaks information about which medications you are taking, and a major investment site leaks information about your investments.

The paper goes on to consider possible mitigations. The most obvious mitigation is to add padding to messages so that their sizes are not so distinctive — for example, every message might be padded to make its size a multiple of 256 bytes. This turns out to be less effective than you might expect — significant information can still leak even if messages are generously padded — and the padded messages are slower and more expensive to transmit.

We don’t know which sites the researchers studied, but it seems like a safe bet that most, if not all, of the sites in these product categories have similar problems. It’s important to keep these attacks in perspective — bear in mind that they can only be carried out by someone who can eavesdrop on the network between you and the site you’re visiting.

It’s becoming increasingly clear that securing web-based applications is very difficult, and that the basic tools for developing web apps don’t do much to help. The industry, and researchers, will be struggling with web app security issues for years to come.

avatar

Domain Names Can't Defend Themselves

Today, the Kentucky Supreme Court handed down an opinion in the saga of Kentucky vs. 141 Domain Names (described a while back here on this blog). Here’s the opinion.

This case is fascinating. A quick recap: Kentucky attempted a property seizure of 141 domain names allegedly involved in gambling on the theory that the domain names themselves constituted “gambling devices” under Kentucky law and were therefore illegal. The state held a forfeiture hearing where anyone with an interest in the “property” could show up to defend their interest in the property; otherwise, the State would order the registrars to transfer “ownership” of the domain names to Kentucky. No individual claiming that they own one of the domain names showed up. Litigation began when two industry associations (iMEGA and IGC) claimed to represent unnamed persons who owned these domain names (and another lawyer showed up during litigation claiming representation of one specific domain name).

The subsequent litigation gets a bit complicated; suffice it to say that the issue of standing was what got to the KY Supreme Court: could an association that claimed it represented an owner of a domain name affected in this action properly represent this owner in court without identifying that owner and that the owner was indeed the owner of an affected domain name?

The Kentucky Supreme Court said no, that there needs to be at least one identified individual owner that will suffer harm before the association can stand in stead, ruling,

Due to the incapacity of domain names to contest their own seizure and the inability of iMEGA and IGC to litigate on behalf of anonymous registrants, the Court of Appeals is reversed and its writ is vacated.

And on the issue of whether a piece of property can represent itself:

“An Internet domain name does not have an interest in itself any more than a piece of land is interested in its own use.”

Anyway, it would seem that the options for next steps include, 1) identifying at least one owner that would suffer harm, then motion back up to the Supreme Court (given that merits had been argued at the Appeals level), or 2) decide that the anonymity of domain name ownership in this case is more important than the fight over this very weird seizure of domain names.

As a non-lawyer, I wonder if it’s possible to represent an owner as a John Doe with an affidavit of ownership of an affected domain name submitted.

UPDATE (2010-03-19T00:07:07 EDT): Check the comments for why a John Doe strategy won’t work when the interest in anonymity is to avoid personal liability rather than free expression.

A weird bonus for people that have read this far: if I open the PDF of the opinion on my Mac in Preview.app or Skim.app (two PDF readers), the “SPORTSBOOK.COM” entry in the listing of the parties on the first page is hyperlinked. However, I don’t see this in Adobe Acrobat Pro or Reader. Seems like the KY Supreme Court is, likely inadvertently, linking to one of the 141 domain names. Of course, Preview.app and Skim.app might be sharing the same library that causes this one URL to be linked… I’m not good-enough of a PDF sleuth to figure it out.

avatar

Round 2 of the PACER Debate: What to Expect

The past year has seen an explosion of interest in free access to the law. Indeed, something of a movement appears to be coalescing around the issue, due in no small part to the growing Law.gov effort (see the latest list of events). One subset of this effort is our work on PACER, the online document access system for the federal courts. We contend that access to electronic court records should be free (see posts from me, Tim, and Harlan). Our RECAP project helps make some of these documents more accessible, and has gained adoption far above our expectations. That being said, RECAP doesn’t solve the fundamental problem: the federal government needs to publish the full public record for free online. Today, this argument came from an unlikely source, the FCC’s National Broadband Plan.

RECOMMENDATION 15.1: the primary legal documents of the federal government should be free and accessible to the public on digital platforms. [...]

- For the Judicial branch, this should apply to all judicial opinions.

[...] Finally, all federal judicial decisions should be accessible for free and made publicly available to the people of the United States. Currently, the Public Access to Court Electronic Records system charges for access to federal appellate, district and bankruptcy court records.[7] As a result, U.S. federal courts pay private contractors approximately $150 million per year for electronic access to judicial documents.[8] [Steve note: The correct figure is $150m over 10 years. However it is quite possible that the federal government as a whole spends $150m or more per year for access to case materials.] While the E-Government Act has mandated that this system change so that this information is as freely available as possible, little progress has been made.[9] Congress should consider providing sufficient funds to publish all federal judicial opinions, orders and decisions online in an easily accessible, machine-readable format.

[7] See Public Access To Court Electronic Records—Overview, http://pacer.psc.uscourts.gov/pacerdesc.html (last visited Jan. 7, 2010).
[8] Carl Malmud, President and CEO, Public.Resource. Org., By the People, Address at the Gov 2.0 Summit, Washington, D.C. 25 (Sept. 10, 2009), available at http://resource.org/people/3waves_cover.pdf
[9] See Letter from Sen. Joseph I. Lieberman to Carl Malamud, President and CEO, Public.Resources.Org (Oct. 13, 2009), available at http://bulk.resource.org/courts.gov/foia/gov.senate.lieberman_20091013_from.pdf

This issue is outside of the Commission’s direct jurisdiction, but the Broadband Plan is intended as a blueprint for the federal government as a whole. In that context, the notion of ensuring that primary legal materials are available for free online fits perfectly with a broader effort to make government digitally accessible. In a similar vein, a bill was introduced today by Rep. Israel. The Public Online Information Act, backed by the Sunlight Foundation, creates a new federal advisory committee to advise all three branches of government on how to make government information available online for free.

To establish an advisory committee to issue nonbinding government-wide guidelines on making public information available on the Internet, to require publicly available Government information held by the executive branch to be made available on the Internet, to express the sense of Congress that publicly available information held by the legislative and judicial branches should be available on the Internet, and for other purposes.

These two developments are the first of what I expect to be many announcements in the coming months, coming from places like the transparency caucus. These announcements will share a theme — there is a growing mandate for universal free access to government information, and judicial information is a key component of that mandate. These requirements will increasingly go to the heart of full free access to the public record, and will reveal the discrepancies between different branches in this regard.

The FCC’s language doesn’t quite get everything right. Most notably, the language focuses on opinions even though there are other components of the record that are key to the public’s understanding of the law. Opinions on PACER are already theoretically free, but the kludgy system for accessing them doesn’t include all of the opinions, isn’t indexable by search engines, and only gives a minimal amount of information about the case that each is a part of. Furthermore, the docket text required to understand the context, and the search functionality required to find the opinions both require a fee. Subsequent calls for free access to case materials will have to be more holistic than the opinions-only language of the Broadband Report.

The POIA language is also a step forward. A federal advisory committee is a good thing in the context of a branch that is more accustomed to the adversarial process than notice-and-comment. However, we will need much more concrete requirements before we will have achieved our goals.

In the context of these announcements, the Administrative Office of the Courts made their own announcement today. The Judicial conference has voted in favor of two measures that make incremental improvements on the current pay-wall model of access to PACER.

  • Adjust the Electronic Public Access fee schedule so that users are not billed unless they accrue charges of more than $10 of PACER usage in a quarterly billing cycle, in effect quadrupling the amount of data available without charge. Currently, users are not billed until their accounts total at least $10 in a one-year period.
  • Approve a pilot in up to 12 courts to publish federal district and bankruptcy court opinions via the Government Printing Office’s Federal Digital System (FDsys) so members of the public can more easily search across opinions and across courts.

These are minor tweaks on a fundamentally limited system. Don’t get me wrong — a world with these changes is better than a world without. It is slightly easier to avoid spending more than $10 in a given quarter than in a given year, but it’s nevertheless likely that you will do so unless you know exactly what you are looking for and retrieve only a few documents. It’s also good to establish precedent for GPO publishing case materials, but that doesn’t require a limited trial that could end in bureaucratic quagmire. The GPO can handle publishing many documents, and any reasonably qualified software engineer could figure out how to deliver them in short order. What’s more, the courts could provide universal free public access today, with zero engineering work: offer a single PACER login that is never billed or, better yet, just stop billing all accounts.

The next round of the PACER debate will be over whether or not we make a fundamental change in access to federal court records, or if we concede minor tweaks and call it a day.

avatar

Global Internet Freedom and the U.S. Government

Over the past two weeks I’ve testified in both the Senate and the House on how the U.S. should advance “Internet freedom.” I submitted written testimony for both hearings which can be downloaded in PDF form here and here. Full transcripts will become available eventually but meanwhile you can click here to watch the Senate video and here to watch the House video. In both hearings I advocated a combination of corporate responsibility through the Global Network Initiative backed up by appropriate legislation given that some companies seem reluctant to hold themselves accountable voluntarily; revision of export controls and sanctions; and finally, funding and support for tools, and technologies and activism platforms that will counter-act suppression of online speech.

Lawmakers are moving forward to support research and technical development. February 4th Rep. David Wu [D-OR] and Rep. Frank Wolf [R-VA] introduced the Internet Freedom Act of 2010, which would establish an Internet Freedom Foundation. The bill’s core section reads:

(a) ESTABLISHMENT OF THE INTERNET FREEDOM FOUNDATION. – The National Science Foundation shall establish the Internet Freedom Foundation. The Internet Freedom Foundation shall –
(1) award competitive, merit-reviewed grants, cooperative aggreements, or contracts to private industry, universities, and other research and development organizations to develop deployable technologies to defeat Internet suppression and censorship; and
(2) award incentive prizes to private industry, universities, and other research and development organizations to develop deployable technologies to defeat Internet suppression and censorship.

(b) LIMITATION ON AUTHORITY. – Nothing in this Act shall be interpreted to authorize any action by the United States to interfere with foreign national censorship in furtherance of law enforcement aims that are consistent with the International Covenant on Civil and Political Rights.

Whoever runs this foundation will have their work cut out for them in sorting out its strategies, goals, and priorities – and dealing with a great deal of thorny politics. The Falun Gong-affiliated Global Internet Freedom Consortium have been arguing that they were unfairly passed over for recent State Department grants which were given to other groups working on different tools that help you get around Internet blocking – “circumvention tools” as the technical community call them. For the past year they’ve been engaged in an aggressive campaign to lobby congress and the media to ensure they’ll get a slice of future funds. (For examples of the fruits of their media lobbying effort see here, here, and here).

But the unfortunate bickering over who deserves government funding more than whom has distracted attention from the larger question of whether circumvention on its own is sufficient to defeat Internet censorship and suppression of online speech. In his recent blog post, Internet Freedom: Beyond Circumvention my friend and former colleague Ethan Zuckerman warns against an over-focus on circumvention: “We can’t circumvent our way around internet censorship.” In short, he summarizes his main points:

- Internet circumvention is hard. It’s expensive. It can make it easier for people to send spam and steal identities.
- Circumventing censorship through proxies just gives people access to international content – it doesn’t address domestic censorship, which likely affects the majority of people’s internet behavior.
- Circumventing censorship doesn’t offer a defense against DDoS or other attacks that target a publisher.

While circumvention tools remain worthy of support as part of a basket of strategies, I agree with Ethan that circumvention is never going to be the silver bullet that some people make it out to be, for all the reasons he outlines in his blog post, which deserves to be read in full. As Ethan points out, as I pointed out in my own testimony, and as my research on Chinese blog censorship published last year has demonstrated, circumvention does nothing to help you access content that has been removed from the Internet completely – which is the main way that the Chinese government now censors the Chinese-language Internet. In my testimony I suggested several other tools and activities that require equal amount of focus:

  • Tools and training to help people evade surveillance, detect spyware, and guard against cyber-attacks.
  • Mechanisms to preserve and re-distribute censored content in various languages.
  • Platforms through which citizens around the world can share “opposition research” about what different governments are doing to suppress online speech, and collaborate across borders to defeat censorship, surveillance, and attacks in ad-hoc, flexible ways as new problems arise during times of crisis.

As Ethan puts it:

- We need to shift our thinking from helping users in closed societies access blocked content to helping publishers reach all audiences. In doing so, we may gain those publishers as a valuable new set of allies as well as opening a new class of technical solutions.

- If our goal is to allow people in closed societies to access an online public sphere, or to use online tools to organize protests, we need to bring the administrators of these tools into the dialog. Secretary Clinton suggests that we make free speech part of the American brand identity – let’s find ways to challenge companies to build blocking resistance into their platforms and to consider internet freedom to be a central part of their business mission. We need to address the fact that making their platforms unblockable has a cost for content hosts and that their business models currently don’t reward them for providing service to these users.

Which brings us to the issue of corporate responsibility for free expression and privacy on the Internet. I’ve been working with the Global Network Initiative for the past several years to develop a voluntary code of conduct centered on a set of basic principles for free expression and privacy based on U.N. documents like the Universal Declaration of Human Rights, the International Covenant on Civil and Political Rights, and other international legal conventions. It is bolstered by a set of implementation guidelines and evaluation and accountability mechanisms, supported by a multi-stakeholder community of human rights groups, investors, and academics all dedicated to helping companies do the right thing and avoid making mistakes that restrict free expression and privacy on the Internet.

So far, however, only Google, Microsoft, and Yahoo have joined. Senator Durbin’s March 2nd Senate hearing focused heavily on the question of why other companies have so far failed to join, what it would take to persuade them to join, and if they don’t join whether laws should be passed that induce greater public accountability by companies on free expression and privacy. He has written letters to 30 U.S. companies in the information and communications technology (ICT) sector. He expressed great displeasure in the hearing with most of their responses, and further disappointment that no company (other than Google which is already in the GNI) even had the guts to send a representative to testify in the hearing.  Durbin announced that he will “introduce legislation that would require Internet companies to take reasonable steps to protect human rights or face civil or criminal liability.” It is my understanding that his bill is still under construction, and it’s not clear when he will introduce it (he’s been rather preoccupied with healthcare and other domestic issues, after all).  Congressman Howard Berman (D-CA), who convened Wednesday’s House Foreign Affairs Committee hearing is also reported to be considering his own bill. Rep. Chris Smith (R-NJ), the ranking Republican at that hearing, made a plug for the Global Online Freedom Act of 2009, a somewhat revised version of a bill that he first introduced in 2006

I said at the hearing that the GNI probably wouldn’t exist if it hadn’t been for the threat of Smith’s legislation. I was not, however, asked my opinion on GOFA’s specific content. Since GOFA’s 2006 introduction I have critiqued it a number of times (see for example here, here, and here). As the years have passed – especially in the past year as the GNI got up and running yet most companies have still failed to engage meaningfully with it  – I have come to see the important role legislation could play in setting industry-wide standards and requirements, which companies can then tell governments they have no choice but to follow. That said, I continue to have concerns about parts of GOFA’s approach. Here is a summary of the current bill written by the Congressional Research Service (I have bolded the parts of concern):

5/6/2009–Introduced.
Global Online Freedom Act of 2009 – Makes it U.S. policy to: (1) promote the freedom to seek, receive, and impart information and ideas through any media; (2) use all appropriate instruments of U.S. influence to support the free flow of information without interference or discrimination; and (3) deter U.S. businesses from cooperating with Internet-restricting countries in effecting online censorship. Expresses the sense of Congress that: (1) the President should seek international agreements to protect Internet freedom; and (2) some U.S. businesses, in assisting foreign governments to restrict online access to U.S.-supported websites and government reports and to identify individual Internet users, are working contrary to U.S. foreign policy interests. Amends the Foreign Assistance Act of 1961 to require assessments of electronic information freedom in each foreign country. Establishes in the Department of State the Office of Global Internet Freedom (OGIF). Directs the Secretary of State to annually designate Internet-restricting countries. Prohibits, subject to waiver, U.S. businesses that provide to the public a commercial Internet search engine, communications services, or hosting services from locating, in such countries, any personally identifiable information used to establish or maintain an Internet services account. Requires U.S. businesses that collect or obtain personally identifiable information through the Internet to notify the OGIF and the Attorney General before responding to a disclosure request from an Internet-restricting country. Authorizes the Attorney General to prohibit a business from complying with the request, except for legitimate foreign law enforcement purposes. Requires U.S. businesses to report to the OGIF certain Internet censorship information involving Internet-restricting countries. Prohibits U.S. businesses that maintain Internet content hosting services from jamming U.S.-supported websites or U.S.-supported content in Internet-restricting countries. Authorizes the President to waive provisions of this Act: (1) to further the purposes of this Act; (2) if a country ceases restrictive activity; or (3) if it is the national interest of the United States.

My biggest concern has to do with the relationship GOFA would create between U.S. companies and the U.S. Attorney General. If the AG is made arbiter of whether content or account information requested by local law enforcement is for “legitimate law enforcement purposes” or not, that means the company has to share the information – which in the case of certain social networking services may include a great deal of non-public information about the user, who his or her friends are, and what they’re saying to each other in casual conversation. Letting the U.S. AG review the insides of this person’s account would certainly violate that user’s privacy. It also puts companies at a competitive disadvantage in markets where users – even those who don’t particularly like their own government – would consider an overly close relationship between a U.S. service provider and the U.S. government not to be in their interest. Take this hypothetical situation for example: An Egyptian college student decides to use a social networking site to set up a group protesting the arrest and torture of his brother. The Egyptian government demands the group be shut down and all account information associated with it handed over. In order to comply with GOFA, the company shares this student’s account information and all content associated with that protest group with the U.S. Attorney General. What is the oversight to ensure that this information is not retained and shared with other U.S. government agencies interested in going on a fishing expedition to explore friendships among members of different Egyptian opposition groups? Why should we expect that user to be ok with such a possibility?

Another difficult issue to get right – which gets even harder with the advent of cloud computing – is the question of where user data is physically housed. The Center for Democracy and Technology,(PDF), Jonathan Zittrain and others have discussed some of the regulatory difficulties of personally identifiable information and its location. In 2008 Zittrain wrote:

As Internet law rapidly evolves, countries have repeatedly and successfully demanded that information be controlled or monitored, even when that information is hosted outside their borders. Forcing US companies to locate their servers outside IRCs [Internet Restricting Countries] would only make their services less reliable; it would not make them less regulable.

If the goal of GOFA is to discourage US companies from violating human rights, then it will probably be successful. But if the goal of the Act is to make the Internet more free and more safe, and not just push rights violations on foreign companies, then more must be done.

Then there is the problem of Internet Restricting Country designations themselves. I have long argued that it is problematic to divide the world into “internet restricting countries” and countries where we can assume everything is just fine, not to worry, no human rights concerns present. First of all I think that the list itself is going to quickly turn into a political and diplomatic football which will be subject to huge amounts of lobbying and politics, and thus will be very difficult to add new countries to the list. Secondly, regimes can change fast: in between annual revisions of the list you can have a coup or a rigged election whose victors demand companies to hand over dissident account information and censor political information, but companies are off the hook – having “done nothing illegal.” Finally, while I am not drawing moral equivalence between Italy and Iran I do believe there is no country on earth, including the United States, where companies are not under pressure by government agencies to do things that arguably violate users’ civil rights. Policy that acknowledges this honestly is less likely to hurt U.S. companies in many parts of the world where the last thing they need is for people to be able to provide “documentary proof” that they are extensions of the U.S. government’s geopolitical agendas.

Therefore a more effective, ethically consistent and less hypocritical approach to the three problems I’ve described above would be to codify strict global privacy standards absolutely everywhere U.S. companies operate. Companies should be required by law to notify all users anywhere in the world in a clear, culturally and linguistically understandable way (not by trained lawyers but by normal people), exactly how and where their personally-identifying information is being stored and used and who has access to it under what circumstances. If users are better informed about how their data is being used, they can use better judgment about how or whether to use different commercial services – and seek more secure alternatives when necessary, perhaps even using some of the new tools and platforms run by non-profit activist organizations that Congress is hoping to fund. Congress could further bolster the privacy of global users of U.S. services by adopting something akin to the Council of Europe Privacy Convention.

Regarding censorship: again, as the Internet evolves further with semi-private social networking sites and mobile services we need to make sure that the information companies are required to share with the U.S. government doesn’t end up violating user privacy.  I am doubtful that government agenices in some of the democracies unlikely to be put on the “internet restricting countries” list can really be trusted not to abuse the systems of censorship and intermediary liability that a growing number of democracies are implementing in the name of legitimate law enforcement purposes. Thus on censorship I also prefer global standards. There is real value in making companies retain internal records of the censorship requests that they receive all around the world in the event of a challenge in U.S. court regarding the lawfulness of a particular act of censorship – a private right of action in U.S. court which GOFA or its equivalent would potentially enable. It’s also good to make companies establish clear and uniform procedures for how they handle censorship requests, so that they can prove if challenged in court that they are only responding to requests made in writing through official legal channels, rather than responding to requests that have no basis even in local law, despite claiming vaguely to the public that “we are only following local law.” Companies should be required to exercise maximum transparency with users about what is being censored, at whose behest, and according to which law exactly. Congress could, for example, mandate that the Chilling Effects Clearinghouse mechanism or something similar should be utilized globally for all content takedowns.
(Originally posted at my blog, RConversation.)

avatar

Netflix Cancels the Netflix Prize 2

Today, Netflix announced it is canceling its plans for a second Netflix Prize contest, one that reportedly would have involved the release of more information than the first. As I argued earlier, I feared that the new contest would have put the supposedly private movie viewing and rating habits of Netflix customers at great risk, and I applaud Netflix for making a very responsible decision. No doubt, pressure from the private lawsuit and FTC investigation helped Netflix make up its mind, and both are reportedly going away as a result of today’s action.

avatar

Best Practices for Government Datasets: Wrap-Up

[This is the fifth and final post in a series on best practices for government datasets by Harlan Yu and me. (previous posts: 1, 2, 3, 4)]

For our final post in this series, we’ll discuss several issues not touched on by earlier posts, including data signing and the use of certain non-text file formats. The relatively brief discussions of these topics should not be interpreted as an indicator of their importance. The topics simply did not fit cleanly into earlier posts.

One significant omission from earlier posts is the issue of data signing with digital signatures. Before discussing this issue, let’s briefly discuss what a digital signature is. Suppose that you want to email me an IOU for $100. Later, I may want to prove that the IOU came from you—it’s of little value if you can claim that I made it up. Conversely, you may want the ability to prove whether the document has been altered. Otherwise, I could claim that you owe me $100,000.

Digital signatures help in proving the origin and authenticity of data. These signatures require that you create two related big numbers, known as keys: a private signing key (known only by you) and a public verification key. To generate a digital signature, you plug the data and your signing key into a complicated formula. The formula spits out another big number known a digital signature. Given the signature and your data, I can use the verification key to prove that the data came unmodified from you. Similarly, nobody can credibly sign modified data without your signing key—so you should be very careful to keep this key a secret.

Developers may want to ensure the authenticity of government data and to prove that authenticity to users. At first glance, the solution seems to be a simple application of digital signatures: agencies sign their data, and anyone can use the signatures to authenticate an agency’s data. In spite of their initially steep learning curve, tools like GnuPG provide straightforward file signing. In practice, the situation is more complicated. First, an agency must decide what data to sign. Perhaps a dataset contains numerous documents. Developers and other users may want signatures not only for the full dataset but also for individual documents in it.

Once an agency knows what to sign, it must decide who will perform the signing. Ideally, the employee producing the dataset would sign it immediately. Unfortunately, this solution requires all such employees to understand the signature tools and to know the agency’s signing key. Widespread distribution of the signing key increases the risk that it will be accidentally revealed. Therefore, a central party is likely to sign most data. Once data is signed, an agency must have a secure channel for delivering the verification key to consumers of the data—users cannot confirm the authenticity of signed data without this key. While signing a given file with a given key may not be hard, surrounding issues are more tricky. We offer no simple solution here, but further discussion of this topic between government agencies, developers, and the public could be useful for all parties.

Another issue that earlier posts did not address is the use of non-text spreadsheet formats, including Microsoft Excel’s XLS format. These formats can sometimes be useful because they allow the embedding of formulas and other rich information along with the data. Unfortunately, these formats are far more complex than raw text formats, so they present a greater challenge for automated processing tools. A comma-separated value (CSV) file is a straightforward text format that contains values separated by line breaks and commas. It provides an alternative to complicated spreadsheet formats. For example, the medal count from the 2010 Winter Olympics in CSV would be:

  Country,Gold,Silver,Bronze,Total
  USA,9,15,13,37
  Germany,10,13,7,30
  Canada,14,7,5,26
  Norway,9,8,6,23
  ...

Fortunately, the release of data in one format does not preclude its release in another format. Most spreadsheet programs provide an option to save data in CSV form. Agencies should release spreadsheet data in a textual format like CSV by default, but an agency should feel free to also release the data in XLS or other formats.

Similarly, agencies will sometimes release large files or groups of files in a compressed or bundled format (for example, ZIP, TAR, GZ, BZ). In these cases, agencies should prominently specify where users can freely obtain software and instructions for extracting the data. Because so many means of compressing and bundling files exist, agencies should not presume that the necessary tools and steps are obvious from the data files themselves.

The rules suggested throughout this series should be seen as best practices rather than hard-and-fast rules. We are still in the process of fleshing out several of these ideas ourselves, and exceptional cases sometimes justify exceptional treatment. In unusual cases, an agency may need to deviate from traditional best practices, but it should carefully consider (and perhaps document) its rationale for doing so. Rules are made to be broken, but they should not be broken for mere expedience.

Our hope is that this series will provide agencies with some points to consider prior to releasing data. Because of Data.gov and the increasing traction of openness and transparency initiatives, we expect to see many more datasets enter the public domain in the coming years. Some agencies will approach the release of bulk data with minimal previous experience. While this poses a challenge, it also present an opportunity for committed agencies to institute good practices early, before bad habits and poor-quality legacy datasets can accumulate. When releasing new datasets, agencies will make numerous conscious and unconscious choices that impact developers. We hope to help agencies understand developers’ challenges when making these choices.

After gathering input from the community, we plan to create a technical report based on this series of posts. Thanks to numerous readers for insightful feedback; your comments have influenced and clarified our thoughts. If any FTT readers inside or outside of government have additional comments about this post or others, please do pass them along.

avatar

Correcting Errors and Making Changes

[This is the fourth post in a series on best practices for government datasets by Harlan Yu and me. (previous posts: 1, 2, 3)]

Even cautiously edited datasets sometimes contain errors, and even meticulously produced schemas require refinement as circumstances change. While errors or changes create inconvenience for developers, most developers appreciate and prepare for their inevitability. Agencies should strive to do the same. A well-developed strategy for fixes and changes can ease their burden on both developers and agencies.

When agencies release data, developers ideally will interact with it in creative new ways. Given datasets containing megabytes to gigabytes of data, novel uses will reveal previously unnoticed errors. Knowledge of these errors benefits the agency as well as other developers using the data, so agencies should take steps to encourage error reporting. Labels in a dataset allow developers to specify errors efficiently and unambiguously. An easy-to-find channel for reporting errors, such as a prominently provided email address or web form, is also critical. Tracking down the contact information of the person responsible for a dataset can be difficult, and a well-known channel reduces this barrier to feedback.

Upon learning of an issue in a dataset, an agency should correct the problem and release the corrected dataset in a timely manner. An important fact to keep in mind when correcting data is that numerous developers may have already downloaded and begun using the old flawed version. For these developers, even a minor modification can cause major issues if not done carefully. Agencies should think about two things: how they will make developers aware that the dataset has been modified and how they will change the dataset itself. The first point is sometimes ignored in spite of its importance. Not only should datasets contain version information, but agencies should also notify developers when the data that they rely on has changed. In particular, agencies should allow developers to subscribe to an email list or an RSS feed for specific datasets that details updates in a well-structured manner. These updates should clearly specify the dataset and version affected, a location where the updated dataset can be found, and a description of the changes to the dataset. When possible, these changes should be specified via a formal, structured description—for example, a diff output—as well as a brief prose explanation.

Correction of dataset contents should proceed cautiously. Suppose that an application allows user to comment on parts of a document. If labels are in a dataset are not maintained consistently across versions, the developer may need to painstakingly map comments from the old data to the corresponding parts of the new dataset. Issues like this can be mitigated through several practices. First, an agency should seek to preserve labels across versions of a dataset when possible (alternatively, in some cases an agency might wish to change the labels but provide a mapping to assist developers). For example, a dataset might aggregate numerous documents, and a minor change in one document should not necessarily change the labels for the other documents. Recall the side note from our previous post that labels should be separate from ordering information. Corrections to a dataset may add, remove, or reorder items. Detaching order from labels can help agencies ensure label consistency across dataset versions. In addition, the last post and its comments discussed whether agencies should provide a label that is separate from its internally used agency label. This separation allows labels to remain consistent even when Subsection X becomes Section Y based on the internal agency labels. Note that these points about consistent labeling can be useful whenever a dataset could have multiple versions: for example, consistent labeling might be beneficial across various versions of a bill.

Similarly, the structure that agencies use for datasets, the locations where the datasets are hosted, and other details of a dataset sometimes must change. Suppose that an agency releases various statistics each month. When the agency is asked to provide a new statistic, the new data may necessitate changes to the XML schema. Alternatively, the agency may decide to host data at the address “http://www.agency.gov/YEAR/MONTH/data.xml” rather than “http://www.agency.gov/MONTH-YEAR/data.xml,” causing issues for automated tools that periodically check for and download new data. To reduce the adverse impact of these changes on developers, agencies should provide detailed notice of the changes as early as possible. Early notice gives developers time to modify their tools. These notifications can occur via an email list or RSS feed providing details of the changes in a clear, consistent format.

The possibility of changes and their impact on developers should be taken into account at all stages of the data production process. Suppose an agency adds an element to a schema that specifies a unique individual, but the schema may someday need to specify a corporation instead. Although the agency should not speculatively add unnecessary elements to the schema, it should be mindful of possible changes when designing the rest of the schema. Various design choices may minimize the impact of a change if necessary later. Agencies should also avoid the urge to alter a schema dramatically each time it requires a minor change. A major overhaul—even when done to clean up the schema—may require equally dramatic changes in tools utilizing the data. To ensure that developers notice changes to XML schemas, both schema files and datasets should contain a prominent schema version number. If an agency changes the location where data is hosted, it should consider temporarily using aliases so that requests using old addresses automatically take you to the correct data. Once the old addresses are phased out, agencies should use a standard HTTP 404 status code to indicate that the requested data was not found at the specified location. Simply supplying a “Not Found” page without this standard code could make life harder for developers whose automated tools must instead parse this page.

When making changes, agencies should consider soliciting input directly from developers. Because the preferences of developers might not be obvious, this input can lead to choices that help developers without increasing the burden on agencies. In fact, developers may even come up with ideas that make life easier for an agency.

Our next and final post in this series will discuss a handful of additional issues for agencies to consider.