January 10, 2025

The End of Gnutella?

Almost exactly 2 years ago, I wrote an essay that examined the case of Arista Records et al v. Lime Group et al. It was presented on Freedom-to-Tinker in a series of three posts (1, 2, 3). These articles presented an analysis which showed that any open filesharing network, such as Gnutella, is vulnerable to spamming. Lime Wire, without advertising as much, was acting as a spam cop for Gnutella, keeping the network safe for infringers. It was my view that the decision in the case could be made to turn on the actions that Lime Wire was taking to control spammers on the Gnutella network, and if the case were examined in that light, Lime Wire could be found liable for contributory infringement while still respecting the First Amendment rights of software publishers.

Since that time, a great deal has occurred in the world of filesharing. It is worthwhile to examine the the current state of affairs, which is predictable in some ways and yet quite surprising in others.

continue reading…

Retiring FedThread

Nearly two years ago, the Federal Register was published in a structured XML format for the first time. This was a big deal in the open government world: the Federal Register, often called the daily newspaper of our federal government, is one of our government’s most widely read publications. And while it could previously be read in paper and PDF forms, it wasn’t easy to digitally manipulate. The XML release changed all this.

When we heard this was happening, four of us here at CITP—Ari Feldman, Bill Zeller, Joe Calandrino, and myself—decided to see how we might be able to improve how citizens could interact with the Federal Register. Our big idea was to make it easy for anyone to comment paragraph-by-paragraph on any of its documents, like a proposed regulation. The site, which we called FedThread, would provide an informal public forum for annotating these documents, and we hoped it would lead to useful online discussions about the merits and weaknesses of all kinds of federal regulatory activity. We also added other useful features, like a full-text search engine and custom RSS feeds. Building these features for the Federal Register only became a straightforward task because of the new XML version. We built the site in just eight days from conception to release.

Another trio of developers in SF also saw opportunities in this free machine-readable resource and developed their own project called GovPulse, which had already won the Sunlight Foundation’s Apps for America 2 contest. They were then approached by the staff of the Federal Register last summer to expand their site to create what would become the new online face of the publication, Federal Register 2.0. Their approach to user comments actually guided users into participating in the formal regulatory comment process—a great idea. Federal Register 2.0 included several features present in FedThread, and many more. Everything was done using open source tools, and made available to the public as open source.

This has left little reason for us to continue operating FedThread. It has continued to reliably provide the features we developed two years ago, but our regular users will find it straightforward to transition to the similar (and often superior) search and subscription features on Federal Register 2.0. So, we’re retiring FedThread. However, the code that we developed will continue to be available, and we hope that enterprising developers will find components to re-use in their own projects that benefit society. For instance, the general purpose paragraph-commenting code that we developed can be useful in a variety of projects. Of course, that code itself was an adaptation of the code supporting another open source project—the Django Book, a free set of documentation about the web framework that we were using to build FedThread (but this is what developers would call a “meta” observation).

Ideally, this is how hacking open government should work. Free machine readable data sets beget useful new ways for citizens to explore those data and make it useful to other citizens. Along the way, they experiment with different ideas, some of which catch on and others of which serve as fodder for the next great idea. This happens faster than standard government contracting, and often produces more innovative results.

Finally, a big thanks to the GPO, NARA and the White House Open Government Initiative for making FedThread possible and for helping to demonstrate that this approach can work, and congratulations on the fantastic Federal Register 2.0.

What Gets Redacted in Pacer?

In my research on privacy problems in PACER, I spent a lot of time examining PACER documents. In addition to researching the problem of “bad” redactions, I was also interested in learning about the pattern of redactions generally. To this end, my software looked for two redaction styles. One is the “black rectangle” redaction method I described in my previous post. This method sometimes fails, but most of these redactions were done successfully. The more common method (around two-thirds of all redactions) involves replacing sensitive information with strings of XXs.

Out of the 1.8 million documents it scanned, my software identified around 11,000 documents that appeared to have redactions. Many of them could be classified automatically (for example “123-45-xxxx” is clearly a redacted Social Security number, and “Exxon” is a false positive) but I examined several thousand by hand.

Here is the distribution of the redacted documents I found.

Type of Sensitive Information No. of Documents
Social Security number 4315
Bank or other account number 675
Address 449
Trade secret 419
Date of birth 290
Unique identifier other than SSN 216
Name of person 129
Phone, email, IP address 60
National security related 26
Health information 24
Miscellaneous 68
Total 6208

To reiterate the point I made in my last post, I didn’t have access to a random sample of the PACER corpus, so we should be cautious about drawing any precise conclusions about the distribution of redacted information in the entire PACER corpus.

Still, I think we can draw some interesting conclusions from these statistics. It’s reasonable to assume that the distribution of redacted sensitive information is similar to the distribution of sensitive information in general. That is, assuming that parties who redact documents do a decent job, this list gives us a (very rough) idea of what kinds of sensitive information can be found in PACER documents.

The most obvious lesson from these statistics is that Social Security numbers are by far the most common type of redacted information in PACER. This is good news, since it’s relatively easy to build software to automatically detect and redact Social Security numbers.

Another interesting case is the “address” category. Almost all of the redacted items in this category—393 out of 449—appear in the District of Columbia District. Many of the documents relate to search warrants and police reports, often in connection with drug cases. I don’t know if the high rate of redaction reflects the different mix of cases in the DC District, or an idiosyncratic redaction policy voluntarily pursued by the courts and/or the DC police but not by officials in other districts. It’s worth noting that the redaction of addresses doesn’t appear to be required by the federal redaction rules.

Finally, there’s the category of “trade secrets,” which is a catch-all term I used for documents whose redactions appear to be confidential business information. Private businesses may have a strong interest in keeping this information confidential, but the public interest in such secrecy here is less clear.

To summarize, out of 6208 redacted documents, there are 4315 Social Security that can be redacted automatically by machine, 449 addresses whose redaction doesn’t seem to be required by the rules of procedure, and 419 “trade secrets” whose release will typically only harm the party who fails to redact it.

That leaves around 1000 documents that would expose risky confidential information if not properly redacted, or about 0.05 percent of the 1.8 million documents I started with. A thousand documents is worth taking seriously (especially given that there are likely to be tens of thousands in the full PACER corpus). The courts should take additional steps to monitor compliance with the redaction rules and sanction parties who fail to comply with them, and they should explore techniques to automate the detection of redaction failures in these categories.

But at the same time, a sense of perspective is important. This tiny fraction of PACER documents with confidential information in them is a cause for concern, but it probably isn’t a good reason to limit public access to the roughly 99.9 percent of documents that contain no sensitive information and may be of significant benefit to the public.

Thanks again to Carl Malamud and Public.Resource.Org for their support of my research.

Universities in Brazil are too closed to the world, and that's bad for innovation

When Brazilian president Dilma Roussef visited China in the beginning of May, she came back with some good news (maybe too good to be entirely true). Among them, the announcement that Foxconn, the largest maker of electronic components, will invest US$12 billion to open a large industrial plant in the country. The goal is to produce iPads and other key electronic components locally.

The announcement was praised, and made it quickly to the headlines of all major newspapers. There is certainly reason for excitement. Brazil lost important waves of economic development, including industrialization (which only really happened in the 1940´s), or the semiconductor wave, an industry that has shown but a few signs of development in the country until now. (continue reading)

New Research Result: Bubble Forms Not So Anonymous

Today, Joe Calandrino, Ed Felten and I are releasing a new result regarding the anonymity of fill-in-the-bubble forms. These forms, popular for their use with standardized tests, require respondents to select answer choices by filling in a corresponding bubble. Contradicting a widespread implicit assumption, we show that individuals create distinctive marks on these forms, allowing use of the marks as a biometric. Using a sample of 92 surveys, we show that an individual’s markings enable unique re-identification within the sample set more than half of the time. The potential impact of this work is as diverse as use of the forms themselves, ranging from cheating detection on standardized tests to identifying the individuals behind “anonymous” surveys or election ballots.

If you’ve taken a standardized test or voted in a recent election, you’ve likely used a bubble form. Filling in a bubble doesn’t provide much room for inadvertent variation. As a result, the marks on these forms superficially appear to be largely identical, and minor differences may look random and not replicable. Nevertheless, our work suggests that individuals may complete bubbles in a sufficiently distinctive and consistent manner to allow re-identification. Consider the following bubbles from two different individuals:

These individuals have visibly different stroke directions, suggesting a means of distinguishing between both individuals. While variation between bubbles may be limited, stroke direction and other subtle features permit differentiation between respondents. If we can learn an individual’s characteristic features, we may use those features to identify that individual’s forms in the future.

To test the limits of our analysis approach, we obtained a set of 92 surveys and extracted 20 bubbles from each of those surveys. We set aside 8 bubbles per survey to test our identification accuracy and trained our model on the remaining 12 bubbles per survey. Using image processing techniques, we identified the unique characteristics of each training bubble and trained a classifier to distinguish between the surveys’ respondents. We applied this classifier to the remaining test bubbles from a respondent. The classifier orders the candidate respondents based on the perceived likelihood that they created the test markings. We repeated this test for each of the 92 respondents, recording where the correct respondent fell in the classifier’s ordered list of candidate respondents.

If bubble marking patterns were completely random, a classifier could do no better than randomly guessing a test set’s creator, with an expected accuracy of 1/92 ? 1%. Our classifier achieves over 51% accuracy. The classifier is rarely far off: the correct answer falls in the classifier’s top three guesses 75% of the time (vs. 3% for random guessing) and its top ten guesses more than 92% of the time (vs. 11% for random guessing). We conducted a number of additional experiments exploring the information available from marked bubbles and potential uses of that information. See our paper for details.

Additional testing—particularly using forms completed at different times—is necessary to assess the real-world impact of this work. Nevertheless, the strength of these preliminary results suggests both positive and negative implications depending on the application. For standardized tests, the potential impact is largely positive. Imagine that a student takes a standardized test, performs poorly, and pays someone to repeat the test on his behalf. Comparing the bubble marks on both answer sheets could provide evidence of such cheating. A similar approach could detect third-party modification of certain answers on a single test.

The possible impact on elections using optical scan ballots is more mixed. One positive use is to detect ballot box stuffing—our methods could help identify whether someone replaced a subset of the legitimate ballots with a set of fraudulent ballots completed by herself. On the other hand, our approach could help an adversary with access to the physical ballots or scans of them to undermine ballot secrecy. Suppose an unscrupulous employer uses a bubble form employment application. That employer could test the markings against ballots from an employee’s jurisdiction to locate the employee’s ballot. This threat is more realistic in jurisdictions that release scans of ballots.

Appropriate mitigation of this issue is somewhat application specific. One option is to treat surveys and ballots as if they contain identifying information and avoid releasing them more widely than necessary. Alternatively, modifying the forms to mask marked bubbles can remove identifying information but, among other risks, may remove evidence of respondent intent. Any application demanding anonymity requires careful consideration of options for preventing creation or disclosure of identifying information. Election officials in particular should carefully examine trade-offs and mitigation techniques if releasing ballot scans.

This work provides another example in which implicit assumptions resulted in a failure to recognize a link between the output of a system (in this case, bubble forms or their scans) and potentially sensitive input (the choices made by individuals completing the forms). Joe discussed a similar link between recommendations and underlying user transactions two weeks ago. As technologies advance or new functionality is added to systems, we must explicitly re-evaluate these connections. The release of scanned forms combined with advances in image analysis raises the possibility that individuals may inadvertently tie themselves to their choices merely by how they complete bubbles. Identifying such connections is a critical first step in exploiting their positive uses and mitigating negative ones.

This work will be presented at the 2011 USENIX Security Symposium in August.