November 28, 2024

Deceptive Assurances of Privacy?

Earlier this week, Facebook expanded the roll-out of its facial recognition software to tag people in photos uploaded to the social networking site. Many observers and regulators responded with privacy concerns; EFF offered a video showing users how to opt-out.

Tim O’Reilly, however, takes a different tack:

Face recognition is here to stay. My question is whether to pretend that it doesn’t exist, and leave its use to government agencies, repressive regimes, marketing data mining firms, insurance companies, and other monolithic entities, or whether to come to grips with it as a society by making it commonplace and useful, figuring out the downsides, and regulating those downsides.

…We need to move away from a Maginot-line like approach where we try to put up walls to keep information from leaking out, and instead assume that most things that used to be private are now knowable via various forms of data mining. Once we do that, we start to engage in a question of what uses are permitted, and what uses are not.

O’Reilly’s point –and face-recognition technology — is bigger than Facebook. Even if Facebook swore off the technology tomorrow, it would be out there, and likely used against us unless regulated. Yet we can’t decide on the proper scope of regulation without understanding the technology and its social implications.

By taking these latent capabilities (Riya was demonstrating them years ago; the NSA probably had them decades earlier) and making them visible, Facebook gives us more feedback on the privacy consequences of the tech. If part of that feedback is “ick, creepy” or worse, we should feed that into regulation for the technology’s use everywhere, not just in Facebook’s interface. Merely hiding the feature in the interface, while leaving it active in the background would be deceptive: it would give us a false assurance of privacy. For all its blundering, Facebook seems to be blundering in the right direction now.

Compare the furor around Dropbox’s disclosure “clarification”. Dropbox had claimed that “All files stored on Dropbox servers are encrypted (AES-256) and are inaccessible without your account password,” but recently updated that to the weaker assertion: “Like most online services, we have a small number of employees who must be able to access user data for the reasons stated in our privacy policy (e.g., when legally required to do so).” Dropbox had signaled “encrypted”: absolutely private, when it meant only relatively private. Users who acted on the assurance of complete secrecy were deceived; now those who know the true level of relative secrecy can update their assumptions and adapt behavior more appropriately.

Privacy-invasive technology and the limits of privacy-protection should be visible. Visibility feeds more and better-controlled experiments to help us understand the scope of privacy, publicity, and the space in between (which Woody Hartzog and Fred Stutzman call “obscurity” in a very helpful draft). Then, we should implement privacy rules uniformly to reinforce our social choices.

New Research Result: Bubble Forms Not So Anonymous

Today, Joe Calandrino, Ed Felten and I are releasing a new result regarding the anonymity of fill-in-the-bubble forms. These forms, popular for their use with standardized tests, require respondents to select answer choices by filling in a corresponding bubble. Contradicting a widespread implicit assumption, we show that individuals create distinctive marks on these forms, allowing use of the marks as a biometric. Using a sample of 92 surveys, we show that an individual’s markings enable unique re-identification within the sample set more than half of the time. The potential impact of this work is as diverse as use of the forms themselves, ranging from cheating detection on standardized tests to identifying the individuals behind “anonymous” surveys or election ballots.

If you’ve taken a standardized test or voted in a recent election, you’ve likely used a bubble form. Filling in a bubble doesn’t provide much room for inadvertent variation. As a result, the marks on these forms superficially appear to be largely identical, and minor differences may look random and not replicable. Nevertheless, our work suggests that individuals may complete bubbles in a sufficiently distinctive and consistent manner to allow re-identification. Consider the following bubbles from two different individuals:

These individuals have visibly different stroke directions, suggesting a means of distinguishing between both individuals. While variation between bubbles may be limited, stroke direction and other subtle features permit differentiation between respondents. If we can learn an individual’s characteristic features, we may use those features to identify that individual’s forms in the future.

To test the limits of our analysis approach, we obtained a set of 92 surveys and extracted 20 bubbles from each of those surveys. We set aside 8 bubbles per survey to test our identification accuracy and trained our model on the remaining 12 bubbles per survey. Using image processing techniques, we identified the unique characteristics of each training bubble and trained a classifier to distinguish between the surveys’ respondents. We applied this classifier to the remaining test bubbles from a respondent. The classifier orders the candidate respondents based on the perceived likelihood that they created the test markings. We repeated this test for each of the 92 respondents, recording where the correct respondent fell in the classifier’s ordered list of candidate respondents.

If bubble marking patterns were completely random, a classifier could do no better than randomly guessing a test set’s creator, with an expected accuracy of 1/92 ? 1%. Our classifier achieves over 51% accuracy. The classifier is rarely far off: the correct answer falls in the classifier’s top three guesses 75% of the time (vs. 3% for random guessing) and its top ten guesses more than 92% of the time (vs. 11% for random guessing). We conducted a number of additional experiments exploring the information available from marked bubbles and potential uses of that information. See our paper for details.

Additional testing—particularly using forms completed at different times—is necessary to assess the real-world impact of this work. Nevertheless, the strength of these preliminary results suggests both positive and negative implications depending on the application. For standardized tests, the potential impact is largely positive. Imagine that a student takes a standardized test, performs poorly, and pays someone to repeat the test on his behalf. Comparing the bubble marks on both answer sheets could provide evidence of such cheating. A similar approach could detect third-party modification of certain answers on a single test.

The possible impact on elections using optical scan ballots is more mixed. One positive use is to detect ballot box stuffing—our methods could help identify whether someone replaced a subset of the legitimate ballots with a set of fraudulent ballots completed by herself. On the other hand, our approach could help an adversary with access to the physical ballots or scans of them to undermine ballot secrecy. Suppose an unscrupulous employer uses a bubble form employment application. That employer could test the markings against ballots from an employee’s jurisdiction to locate the employee’s ballot. This threat is more realistic in jurisdictions that release scans of ballots.

Appropriate mitigation of this issue is somewhat application specific. One option is to treat surveys and ballots as if they contain identifying information and avoid releasing them more widely than necessary. Alternatively, modifying the forms to mask marked bubbles can remove identifying information but, among other risks, may remove evidence of respondent intent. Any application demanding anonymity requires careful consideration of options for preventing creation or disclosure of identifying information. Election officials in particular should carefully examine trade-offs and mitigation techniques if releasing ballot scans.

This work provides another example in which implicit assumptions resulted in a failure to recognize a link between the output of a system (in this case, bubble forms or their scans) and potentially sensitive input (the choices made by individuals completing the forms). Joe discussed a similar link between recommendations and underlying user transactions two weeks ago. As technologies advance or new functionality is added to systems, we must explicitly re-evaluate these connections. The release of scanned forms combined with advances in image analysis raises the possibility that individuals may inadvertently tie themselves to their choices merely by how they complete bubbles. Identifying such connections is a critical first step in exploiting their positive uses and mitigating negative ones.

This work will be presented at the 2011 USENIX Security Symposium in August.

Tinkering with the IEEE and ACM copyright policies

It’s historically been the case that papers published in an IEEE or ACM conference or journal must have their copyrights assigned to the IEEE or ACM, respectively. Most of us were happy with this sort of arrangement, but the new IEEE policy seems to apply more restrictions on this process. Matt Blaze blogged about this issue in particular detail.

The IEEE policy and the comparable ACM policy appear to be focused on creating revenue opportunities for these professional societies. Hypothetically, that income should result in cost savings elsewhere (e.g., lower conference registration fees) or in higher quality member services (e.g., paying the expenses of conference program committee members to attend meetings). In practice, neither of these are true. Regardless, our professional societies work hard to keep a paywall between our papers and their readership. Is this sort of behavior in our best interests? Not really.

What benefits the author of an academic paper? In a word, impact. Papers that are more widely read are more widely influential. Furthermore, widely read papers are more widely cited; citation counts are explicitly considered in hiring, promotion, and tenure cases. Anything that gets in the way of a paper’s impact is something that damages our careers and it’s something we need to fix.

There are three common solutions. First, we ignore the rules and post copies of our work on our personal, laboratory, and/or departmental web pages. Virtually any paper written in the past ten years can be found online, without cost, and conveniently cataloged by sites like Google Scholar. Second, some authors I’ve spoken to will significantly edit the copyright assignment forms before submitting them. Nobody apparently ever notices this. Third, some professional societies, notably the USENIX Association, have changed their rules. The USENIX policy completely inverts the relationship between author and publisher. Authors grant USENIX certain limited and reasonable rights, while the authors retain copyright over their work. USENIX then posts all the papers on its web site, free of charge; authors are free to do the same on their own web sites.

(USENIX ensures that every conference proceedings has a proper ISBN number. Every USENIX paper is just as “published” as a paper in any other conference, even though printed proceedings are long gone.)

Somehow, the sky hasn’t fallen. So far as I know, the USENIX Association’s finances still work just fine. Perhaps it’s marginally more expensive to attend a USENIX conference, but then the service level is also much higher. The USENIX professional staff do things that are normally handled by volunteer labor at other conferences.

This brings me to the vote we had last week at the IEEE Symposium on Security and Privacy (the “Oakland” conference) during the business meeting. We had an unusually high attendance (perhaps 150 out of 400 attendees) as there were a variety of important topics under discussion. We spent maybe 15 minutes talking about the IEEE’s copyright policy and the resolution before the room was should we reject the IEEE copyright policy and adopt the USENIX policy? Ultimately, there were two “no” votes and everybody else voted “yes.” That’s an overwhelming statement.

The question is what happens next. I’m planning to attend ACM CCS this October in Chicago and I expect we can have a similar vote there. I hope similar votes can happen at other IEEE and ACM conferences. Get it on the agenda of your business meetings. Vote early and vote often! I certainly hope the IEEE and ACM agree to follow the will of their membership. If the leadership don’t follow the membership, then we’ve got some more interesting problems that we’ll need to solve.

Sidebar: ACM and IEEE make money by reselling our work, particularly with institutional subscriptions to university libraries and large companies. As an ACM or IEEE member, you also get access to some, but not all, of the online library contents. If you make everything free (as in free beer), removing that revenue source, then you’ve got a budget hole to fill. While I’m no budget wizard, it would make sense for our conference registration fees to support the archival online storage of our papers. Add in some online advertising (example: startup companies, hungry to hire engineers with specialized talents, would pay serious fees for advertisements adjacent to research papers in the relevant areas), and I’ll bet everything would work out just fine.

Studying the Frequency of Redaction Failures in PACER

Since we launched RECAP a couple of years ago, one of our top concerns has been privacy. The federal judiciary’s PACER system offers the public online access to hundreds of millions of court records. The judiciary’s rules require each party in a case to redact certain types of information from documents they submit, but unfortunately litigants and their counsel don’t always comply with these rules. Three years ago, Carl Malamud did a groundbreaking audit of PACER documents and found more than 1600 cases in which litigants submitted documents with unredacted Social Security numbers. My recent research has focused on a different problem: cases where parties tried to redact sensitive information but the redactions failed for technical reasons. This problem occasionally pops up in news stories, but as far as I know, no one has conducted a systematic study.

To understand the problem, it helps to know a little bit about how computers represent graphics. The simplest image formats are bitmap or raster formats. These represent an image as an array of pixels, with each pixel having a color represented by a numeric value. The PDF format uses a different approach, known as vector graphics, that represent an image as a series of drawing commands: lines, rectangles, lines of text, and so forth.

Vector graphics have important advantages. Vector-based formats “scale up” gracefully, in contrast to the raster images that look “blocky” at high resolutions. Vector graphics also do a better job of preserving a document’s structure. For example, text in a PDF is represented by a sequence of explicit text-drawing commands, which is why you can cut and paste text from a PDF document, but not from a raster format like PNG.

But vector-based formats also have an important disadvantage: they may contain more information than is visible to the naked eye. Raster images have a “what you see is what you get” quality—changing all the pixels in a particular region to black destroys the information that was previously in that part of the image. But a vector-based image can have multiple “layers.” There might be a command to draw some text followed by a command to draw a black rectangle over the text. The image might look like it’s been redacted, but the text is still “under” the box. And often extracting that information is a simple matter of cutting and pasting.

So how many PACER documents have this problem? We’re in a good position to study this question because we have a large collection of PACER documents—1.8 million of them when I started my research last year. I wrote software to detect redaction rectangles—it turns out these are relatively easy to recognize based on their color, shape, and the specific commands used to draw them. Out of 1.8 million PACER documents, there were approximately 2000 documents with redaction rectangles. (There were also about 3500 documents that were redacted by replacing text by strings of Xes, I also excluded documents that were redacted by Carl Malamud before he donated them to our archive.)

Next, my software checked to see if these redaction rectangles overlapped with text. My software identified a few hundred documents that appeared to have text under redaction rectangles, and examining them by hand revealed 194 documents with failed redactions. The majority of the documents (about 130) appear be from commercial litigation, in which parties have unsuccessfully attempted to redact trade secrets such as sales figures and confidential product information. Other improperly redacted documents contain sensitive medical information, addresses, and dates of birth. Still others contain the names of witnesses, jurors, plaintiffs, and one minor.

Implications

PACER reportedly contains about 500 million documents. We don’t have a random sample of PACER documents, so we should be careful about trying to extrapolate to the entire PACER corpus. Still, it’s safe to say there are thousands, and probably tens of thousands, of documents in PACER whose authors made unsuccessful attempts to conceal information.

It’s also important to note that my software may not be detecting every instance of redaction failures. If a PDF was created by scanning in a paper document (as opposed to generated directly from a word processor), then it probably won’t have a “text layer.” My software doesn’t detect redaction failures in this type of document. This means that there may be more than 194 failed redactions among the 1.8 million documents I studied.

A few weeks ago I wrote a letter to Judge Lee Rosenthal, chair of the federal judiciary’s Committee on Rules of Practice and Procedure, explaining this problem. In that letter I recommend that the courts themselves use software like mine to automatically scan PACER documents for this type of problem. In addition to scanning the documents they already have, the courts should make it a standard part of the process for filing new documents with the courts. This would allow the courts to catch these problems before the documents are made available to the public on the PACER website.

My code is available here. It’s experimental research code, not a finished product. We’re releasing it into the public domain using the CC0 license; this should make it easy for federal and state officials to adapt it for their own use. Court administrators who are interested in adapting the code for their own use are especially encouraged to contact me for advice and assistance. The code relies heavily on the CAM::PDF Perl library, and I’m indebted to Chris Dolan for his patient answers to my many dumb questions.

Getting Redaction Right

So what should litigants do to avoid this problem? The National Security Agency has a good primer on secure redaction. The approach they recommend—completely deleting sensitive information in the original word processing document, replacing it with innocuous filler (such as strings of XXes) as needed, and then converting it to a PDF document, is the safest approach. The NSA primer also explains how to check for other potentially sensitive information that might be hidden in a document’s metadata.

Of course, there may be cases where this approach isn’t feasible because a litigant doesn’t have the original word processing document or doesn’t want the document’s layout to be changed by the redaction process. Adobe Acrobat’s redaction tool has worked correctly when we’ve used it, and Adobe probably has the expertise to do it correctly. There may be other tools that work correctly, but we haven’t had an opportunity to experiment with them so we can’t say which ones they might be.

Regardless of the tool used, it’s a good idea to take the redacted document and double-check that the information was removed. An easy way to do this is to simply cut and paste the “redacted” content into another document. If the redaction succeeded, no text should be transferred. This method will catch most, but not all, redaction failures. A more rigorous check is to remove the redaction rectangles from the document and manually observe what’s underneath them. One of the scripts I’m releasing today, called remove_rectangles.pl, does just that. In its current form, it’s probably not user-friendly enough for non-programmers to use, but it would be relatively straightforward for someone (perhaps Adobe or the courts) to build a user-friendly version that ordinary users could use to verify that the document they just attempted to redact actually got redacted.

One approach we don’t endorse is printing the document out, redacting it with a black marker, and then re-scanning it to PDF format. Although this may succeed in removing the sensitive information, we don’t recommend this approach because it effectively converts the document into a raster-based image, destroying useful information in the process. For example, it will no longer be possible to cut and paste (non-redacted) text from a document that has been redacted in this way.

Bad redactions are not a new problem, but they are taking on a new urgency as PACER documents become increasingly available on the web. Correct redaction is not difficult, but it does require both knowledge and care by those who are submitting the documents. The courts have several important roles they should play: educating attorneys about their redaction responsibilities, providing them with software tools that make it easy for them to comply, and monitoring submitted documents to verify that the rules are being followed.

This research was made possible with the financial support of Carl Malamud’s organization, Public.Resource.Org.

"You Might Also Like:" Privacy Risks of Collaborative Filtering

Ann Kilzer, Arvind Narayanan, Ed Felten, Vitaly Shmatikov, and I have released a new research paper detailing the privacy risks posed by collaborative filtering recommender systems. To examine the risk, we use public data available from Hunch, LibraryThing, Last.fm, and Amazon in addition to evaluating a synthetic system using data from the Netflix Prize dataset. The results demonstrate that temporal changes in recommendations can reveal purchases or other transactions of individual users.

To help users find items of interest, sites routinely recommend items similar to a given item. For example, product pages on Amazon contain a “Customers Who Bought This Item Also Bought” list. These recommendations are typically public, and they are the product of patterns learned from all users of the system. If customers often purchase both item A and item B, a collaborative filtering system will judge them to be highly similar. Most sites generate ordered lists of similar items for any given item, but some also provide numeric similarity scores.

Although item similarity is only indirectly related to individual transactions, we determined that temporal changes in item similarity lists or scores can reveal details of those transactions. If you’re a Mozart fan and you listen to a Justin Bieber song, this choice increases the perceived similarity between Justin Bieber and Mozart. Because similarity lists and scores are based on perceived similarity, your action may result in changes to these scores or lists.

Suppose that an attacker knows some of your past purchases on a site: for example, past item reviews, social networking profiles, or real-world interactions are a rich source of information. New purchases will affect the perceived similarity between the new items and your past purchases, possibility causing visible changes to the recommendations provided for your previously purchased items. We demonstrate that an attacker can leverage these observable changes to infer your purchases. Among other things, these attacks are complicated by the fact that multiple users simultaneously interact with a system and updates are not immediate following a transaction.

To evaluate our attacks, we use data from Hunch, LibraryThing, Last.fm, and Amazon. Our goal is not to claim privacy flaws in these specific sites (in fact, we often use data voluntarily disclosed by their users to verify our inferences), but to demonstrate the general feasibility of inferring individual transactions from the outputs of collaborative filtering systems. Among their many differences, these sites vary dramatically in the information that they reveal. For example, Hunch reveals raw item-to-item correlation scores, but Amazon reveals only lists of similar items. In addition, we examine a simulated system created using the Netflix Prize dataset. Our paper outlines the experimental results.

While inference of a Justin Bieber interest may be innocuous, inferences could expose anything from dissatisfaction with a job to health issues. Our attacks assume that a victim reveals certain past transactions, but users may publicly reveal certain transactions while preferring to keep others private. Ultimately, users are best equipped to determine which transactions would be embarrassing or otherwise problematic. We demonstrate that the public outputs of recommender systems can reveal transactions without user knowledge or consent.

Unfortunately, existing privacy technologies appear inadequate here, failing to simultaneously guarantee acceptable recommendation quality and user privacy. Mitigation strategies are a rich area for future work, and we hope to work towards solutions with others in the community.

Worth noting is that this work suggests a risk posed by any feature that adapts in response to potentially sensitive user actions. Unless sites explicitly consider the data exposed, such features may inadvertently leak details of these underlying actions.

Our paper contains additional details. This work was presented earlier today at the 2011 IEEE Symposium on Security and Privacy. Arvind has also blogged about this work.