November 24, 2024

New Research Result: Bubble Forms Not So Anonymous

Today, Joe Calandrino, Ed Felten and I are releasing a new result regarding the anonymity of fill-in-the-bubble forms. These forms, popular for their use with standardized tests, require respondents to select answer choices by filling in a corresponding bubble. Contradicting a widespread implicit assumption, we show that individuals create distinctive marks on these forms, allowing use of the marks as a biometric. Using a sample of 92 surveys, we show that an individual’s markings enable unique re-identification within the sample set more than half of the time. The potential impact of this work is as diverse as use of the forms themselves, ranging from cheating detection on standardized tests to identifying the individuals behind “anonymous” surveys or election ballots.

If you’ve taken a standardized test or voted in a recent election, you’ve likely used a bubble form. Filling in a bubble doesn’t provide much room for inadvertent variation. As a result, the marks on these forms superficially appear to be largely identical, and minor differences may look random and not replicable. Nevertheless, our work suggests that individuals may complete bubbles in a sufficiently distinctive and consistent manner to allow re-identification. Consider the following bubbles from two different individuals:

These individuals have visibly different stroke directions, suggesting a means of distinguishing between both individuals. While variation between bubbles may be limited, stroke direction and other subtle features permit differentiation between respondents. If we can learn an individual’s characteristic features, we may use those features to identify that individual’s forms in the future.

To test the limits of our analysis approach, we obtained a set of 92 surveys and extracted 20 bubbles from each of those surveys. We set aside 8 bubbles per survey to test our identification accuracy and trained our model on the remaining 12 bubbles per survey. Using image processing techniques, we identified the unique characteristics of each training bubble and trained a classifier to distinguish between the surveys’ respondents. We applied this classifier to the remaining test bubbles from a respondent. The classifier orders the candidate respondents based on the perceived likelihood that they created the test markings. We repeated this test for each of the 92 respondents, recording where the correct respondent fell in the classifier’s ordered list of candidate respondents.

If bubble marking patterns were completely random, a classifier could do no better than randomly guessing a test set’s creator, with an expected accuracy of 1/92 ? 1%. Our classifier achieves over 51% accuracy. The classifier is rarely far off: the correct answer falls in the classifier’s top three guesses 75% of the time (vs. 3% for random guessing) and its top ten guesses more than 92% of the time (vs. 11% for random guessing). We conducted a number of additional experiments exploring the information available from marked bubbles and potential uses of that information. See our paper for details.

Additional testing—particularly using forms completed at different times—is necessary to assess the real-world impact of this work. Nevertheless, the strength of these preliminary results suggests both positive and negative implications depending on the application. For standardized tests, the potential impact is largely positive. Imagine that a student takes a standardized test, performs poorly, and pays someone to repeat the test on his behalf. Comparing the bubble marks on both answer sheets could provide evidence of such cheating. A similar approach could detect third-party modification of certain answers on a single test.

The possible impact on elections using optical scan ballots is more mixed. One positive use is to detect ballot box stuffing—our methods could help identify whether someone replaced a subset of the legitimate ballots with a set of fraudulent ballots completed by herself. On the other hand, our approach could help an adversary with access to the physical ballots or scans of them to undermine ballot secrecy. Suppose an unscrupulous employer uses a bubble form employment application. That employer could test the markings against ballots from an employee’s jurisdiction to locate the employee’s ballot. This threat is more realistic in jurisdictions that release scans of ballots.

Appropriate mitigation of this issue is somewhat application specific. One option is to treat surveys and ballots as if they contain identifying information and avoid releasing them more widely than necessary. Alternatively, modifying the forms to mask marked bubbles can remove identifying information but, among other risks, may remove evidence of respondent intent. Any application demanding anonymity requires careful consideration of options for preventing creation or disclosure of identifying information. Election officials in particular should carefully examine trade-offs and mitigation techniques if releasing ballot scans.

This work provides another example in which implicit assumptions resulted in a failure to recognize a link between the output of a system (in this case, bubble forms or their scans) and potentially sensitive input (the choices made by individuals completing the forms). Joe discussed a similar link between recommendations and underlying user transactions two weeks ago. As technologies advance or new functionality is added to systems, we must explicitly re-evaluate these connections. The release of scanned forms combined with advances in image analysis raises the possibility that individuals may inadvertently tie themselves to their choices merely by how they complete bubbles. Identifying such connections is a critical first step in exploiting their positive uses and mitigating negative ones.

This work will be presented at the 2011 USENIX Security Symposium in August.

Tinkering with the IEEE and ACM copyright policies

It’s historically been the case that papers published in an IEEE or ACM conference or journal must have their copyrights assigned to the IEEE or ACM, respectively. Most of us were happy with this sort of arrangement, but the new IEEE policy seems to apply more restrictions on this process. Matt Blaze blogged about this issue in particular detail.

The IEEE policy and the comparable ACM policy appear to be focused on creating revenue opportunities for these professional societies. Hypothetically, that income should result in cost savings elsewhere (e.g., lower conference registration fees) or in higher quality member services (e.g., paying the expenses of conference program committee members to attend meetings). In practice, neither of these are true. Regardless, our professional societies work hard to keep a paywall between our papers and their readership. Is this sort of behavior in our best interests? Not really.

What benefits the author of an academic paper? In a word, impact. Papers that are more widely read are more widely influential. Furthermore, widely read papers are more widely cited; citation counts are explicitly considered in hiring, promotion, and tenure cases. Anything that gets in the way of a paper’s impact is something that damages our careers and it’s something we need to fix.

There are three common solutions. First, we ignore the rules and post copies of our work on our personal, laboratory, and/or departmental web pages. Virtually any paper written in the past ten years can be found online, without cost, and conveniently cataloged by sites like Google Scholar. Second, some authors I’ve spoken to will significantly edit the copyright assignment forms before submitting them. Nobody apparently ever notices this. Third, some professional societies, notably the USENIX Association, have changed their rules. The USENIX policy completely inverts the relationship between author and publisher. Authors grant USENIX certain limited and reasonable rights, while the authors retain copyright over their work. USENIX then posts all the papers on its web site, free of charge; authors are free to do the same on their own web sites.

(USENIX ensures that every conference proceedings has a proper ISBN number. Every USENIX paper is just as “published” as a paper in any other conference, even though printed proceedings are long gone.)

Somehow, the sky hasn’t fallen. So far as I know, the USENIX Association’s finances still work just fine. Perhaps it’s marginally more expensive to attend a USENIX conference, but then the service level is also much higher. The USENIX professional staff do things that are normally handled by volunteer labor at other conferences.

This brings me to the vote we had last week at the IEEE Symposium on Security and Privacy (the “Oakland” conference) during the business meeting. We had an unusually high attendance (perhaps 150 out of 400 attendees) as there were a variety of important topics under discussion. We spent maybe 15 minutes talking about the IEEE’s copyright policy and the resolution before the room was should we reject the IEEE copyright policy and adopt the USENIX policy? Ultimately, there were two “no” votes and everybody else voted “yes.” That’s an overwhelming statement.

The question is what happens next. I’m planning to attend ACM CCS this October in Chicago and I expect we can have a similar vote there. I hope similar votes can happen at other IEEE and ACM conferences. Get it on the agenda of your business meetings. Vote early and vote often! I certainly hope the IEEE and ACM agree to follow the will of their membership. If the leadership don’t follow the membership, then we’ve got some more interesting problems that we’ll need to solve.

Sidebar: ACM and IEEE make money by reselling our work, particularly with institutional subscriptions to university libraries and large companies. As an ACM or IEEE member, you also get access to some, but not all, of the online library contents. If you make everything free (as in free beer), removing that revenue source, then you’ve got a budget hole to fill. While I’m no budget wizard, it would make sense for our conference registration fees to support the archival online storage of our papers. Add in some online advertising (example: startup companies, hungry to hire engineers with specialized talents, would pay serious fees for advertisements adjacent to research papers in the relevant areas), and I’ll bet everything would work out just fine.

Studying the Frequency of Redaction Failures in PACER

Since we launched RECAP a couple of years ago, one of our top concerns has been privacy. The federal judiciary’s PACER system offers the public online access to hundreds of millions of court records. The judiciary’s rules require each party in a case to redact certain types of information from documents they submit, but unfortunately litigants and their counsel don’t always comply with these rules. Three years ago, Carl Malamud did a groundbreaking audit of PACER documents and found more than 1600 cases in which litigants submitted documents with unredacted Social Security numbers. My recent research has focused on a different problem: cases where parties tried to redact sensitive information but the redactions failed for technical reasons. This problem occasionally pops up in news stories, but as far as I know, no one has conducted a systematic study.

To understand the problem, it helps to know a little bit about how computers represent graphics. The simplest image formats are bitmap or raster formats. These represent an image as an array of pixels, with each pixel having a color represented by a numeric value. The PDF format uses a different approach, known as vector graphics, that represent an image as a series of drawing commands: lines, rectangles, lines of text, and so forth.

Vector graphics have important advantages. Vector-based formats “scale up” gracefully, in contrast to the raster images that look “blocky” at high resolutions. Vector graphics also do a better job of preserving a document’s structure. For example, text in a PDF is represented by a sequence of explicit text-drawing commands, which is why you can cut and paste text from a PDF document, but not from a raster format like PNG.

But vector-based formats also have an important disadvantage: they may contain more information than is visible to the naked eye. Raster images have a “what you see is what you get” quality—changing all the pixels in a particular region to black destroys the information that was previously in that part of the image. But a vector-based image can have multiple “layers.” There might be a command to draw some text followed by a command to draw a black rectangle over the text. The image might look like it’s been redacted, but the text is still “under” the box. And often extracting that information is a simple matter of cutting and pasting.

So how many PACER documents have this problem? We’re in a good position to study this question because we have a large collection of PACER documents—1.8 million of them when I started my research last year. I wrote software to detect redaction rectangles—it turns out these are relatively easy to recognize based on their color, shape, and the specific commands used to draw them. Out of 1.8 million PACER documents, there were approximately 2000 documents with redaction rectangles. (There were also about 3500 documents that were redacted by replacing text by strings of Xes, I also excluded documents that were redacted by Carl Malamud before he donated them to our archive.)

Next, my software checked to see if these redaction rectangles overlapped with text. My software identified a few hundred documents that appeared to have text under redaction rectangles, and examining them by hand revealed 194 documents with failed redactions. The majority of the documents (about 130) appear be from commercial litigation, in which parties have unsuccessfully attempted to redact trade secrets such as sales figures and confidential product information. Other improperly redacted documents contain sensitive medical information, addresses, and dates of birth. Still others contain the names of witnesses, jurors, plaintiffs, and one minor.

Implications

PACER reportedly contains about 500 million documents. We don’t have a random sample of PACER documents, so we should be careful about trying to extrapolate to the entire PACER corpus. Still, it’s safe to say there are thousands, and probably tens of thousands, of documents in PACER whose authors made unsuccessful attempts to conceal information.

It’s also important to note that my software may not be detecting every instance of redaction failures. If a PDF was created by scanning in a paper document (as opposed to generated directly from a word processor), then it probably won’t have a “text layer.” My software doesn’t detect redaction failures in this type of document. This means that there may be more than 194 failed redactions among the 1.8 million documents I studied.

A few weeks ago I wrote a letter to Judge Lee Rosenthal, chair of the federal judiciary’s Committee on Rules of Practice and Procedure, explaining this problem. In that letter I recommend that the courts themselves use software like mine to automatically scan PACER documents for this type of problem. In addition to scanning the documents they already have, the courts should make it a standard part of the process for filing new documents with the courts. This would allow the courts to catch these problems before the documents are made available to the public on the PACER website.

My code is available here. It’s experimental research code, not a finished product. We’re releasing it into the public domain using the CC0 license; this should make it easy for federal and state officials to adapt it for their own use. Court administrators who are interested in adapting the code for their own use are especially encouraged to contact me for advice and assistance. The code relies heavily on the CAM::PDF Perl library, and I’m indebted to Chris Dolan for his patient answers to my many dumb questions.

Getting Redaction Right

So what should litigants do to avoid this problem? The National Security Agency has a good primer on secure redaction. The approach they recommend—completely deleting sensitive information in the original word processing document, replacing it with innocuous filler (such as strings of XXes) as needed, and then converting it to a PDF document, is the safest approach. The NSA primer also explains how to check for other potentially sensitive information that might be hidden in a document’s metadata.

Of course, there may be cases where this approach isn’t feasible because a litigant doesn’t have the original word processing document or doesn’t want the document’s layout to be changed by the redaction process. Adobe Acrobat’s redaction tool has worked correctly when we’ve used it, and Adobe probably has the expertise to do it correctly. There may be other tools that work correctly, but we haven’t had an opportunity to experiment with them so we can’t say which ones they might be.

Regardless of the tool used, it’s a good idea to take the redacted document and double-check that the information was removed. An easy way to do this is to simply cut and paste the “redacted” content into another document. If the redaction succeeded, no text should be transferred. This method will catch most, but not all, redaction failures. A more rigorous check is to remove the redaction rectangles from the document and manually observe what’s underneath them. One of the scripts I’m releasing today, called remove_rectangles.pl, does just that. In its current form, it’s probably not user-friendly enough for non-programmers to use, but it would be relatively straightforward for someone (perhaps Adobe or the courts) to build a user-friendly version that ordinary users could use to verify that the document they just attempted to redact actually got redacted.

One approach we don’t endorse is printing the document out, redacting it with a black marker, and then re-scanning it to PDF format. Although this may succeed in removing the sensitive information, we don’t recommend this approach because it effectively converts the document into a raster-based image, destroying useful information in the process. For example, it will no longer be possible to cut and paste (non-redacted) text from a document that has been redacted in this way.

Bad redactions are not a new problem, but they are taking on a new urgency as PACER documents become increasingly available on the web. Correct redaction is not difficult, but it does require both knowledge and care by those who are submitting the documents. The courts have several important roles they should play: educating attorneys about their redaction responsibilities, providing them with software tools that make it easy for them to comply, and monitoring submitted documents to verify that the rules are being followed.

This research was made possible with the financial support of Carl Malamud’s organization, Public.Resource.Org.

Don't love the cyber bomb, but don't ignore it either

Cybersecurity is overblown – or not

A recent report by Jerry Brito and Tate Watkins of George Mason University titled “Loving The Cyber Bomb? The Dangers Of Threat Inflation In Cybersecurity Policy” has gotten a bit of press. This is an important topic worthy of debate, but I believe their conclusions are incorrect. In this posting, I’ll summarize their report and explain why I think they’re wrong.

Brito & Watkins (henceforth B&W) argue that the cyber threat is exaggerated, and its being driven by private industry anxious to feed at the public trough in a manner similar to the creation of the military industrial complex in the second half of the 20th century as an outgrowth of the Cold War.

The paper starts by describing how deliberate misinformation in the run-up to the Iraq war is an example of how public opinion can be manipulated by policy makers and private industry trying to sell a threat. My opinion of the Iraq war is not relevant to this discussion, but I believe they’re using to create a strawman which they then knock down.

Next, B&W they use the CSIS Commission Report on Cybersecurity for the 44th Presidency and Richard Clarke’s “Cyber War” to argue that the threat of cyber conflict has been overblown. With regard to the former, they criticize the confusion of probes (port scans) with real attacks, and argue that probes are not evidence of an attack or breach but more akin to doorknob rattling. While that’s certainly true (and an analogy that’s been made for years), if your doorknob is rattled thousands of times a day it’s a strong indication that you’re living in a bad neighborhood! They then note that there’s little unclassified proof of real threats, and hence the call for regulation by CSIS (and others) is inappropriate. Unfortunately, quantitative proof is hard to come by, but there are enough incidents that there can be little doubt as to the severity of the threat. Requiring quantitative data before we move to protection would be akin to demanding an open and accurate assessment of the number of foreign spies and the damage they do before we fund the CIA! Instead, we rely on experts in spycraft to assess the threat, and help define appropriate defenses. In the same way, we should rely on cybersecurity experts to provide an assessment of the risks and appropriate actions. I certainly agree with both CSIS and B&W that overclassification of the threats works to our detriment – if the public is unable to see the threat, it becomes hard to justify spending to defend against it. I’ve personally seen this in the commercial software industry, where the inability to provide hard data about cyber threats to senior management results in that threat being discounted, with consequent risk to businesses. But again, the problems with overclassification do not mean the problem doesn’t exist.

Regarding Clarke’s book, there’s been plenty of criticism of both technical inaccuracies and the somewhat hysterical tone. Those notwithstanding, Clarke generally has a good understanding of the types of threats and the risks. B&W’s claim that the only verifiable attacks are DDOS is simply untrue – there have been verified attacks against infrastructure like water systems, although some of the claimed attacks are other types of failures that could have been cyber-related, but aren’t. As an example, while Clarke claims that the northeast power blackout of 2003 was cyber-related, there’s adequate evidence that it was not – but there’s also adequate evidence that such an accidental failure could be caused by a deliberate attack. Similarly, the NYSE “flash crash” was not caused by a cyber attack, but demonstrates the fragility of modern highly computerized systems, and shows that a cyber attack could cause similar symptoms. That which can happen by accident can also happen intentionally, if an adversary desires.

As for B&W’s analogy to the military industrial complex that President Eisenhower so famously feared, and the increasing influence of cyberpork, I must reluctantly agree. Large defense contractors have, in recent years, flocked to cyber as it has become trendy and large budgets have become attractive, frequently more concerned with revenue than with solving problems. However, the problems existed (and were being discussed) by researchers and practitioners long before the influx of government contractors. The fact that they’re trying to make money off the problem doesn’t mean the problem doesn’t exist.

The final section of the paper, covering regulatory issues, has some good points, but it is so poisoned by the assumptions in the earlier sections of the paper that it’s hard to take seriously.

To summarize, we should distinguish between the existence of the problem (which is real and growing) versus the desire of some government contractors to cash in – the fact that the latter is occurring does not deny the reality of the former.

Summary of W3C DNT Workshop Submissions

Last week, we hosted the W3C “Web Tracking and User Privacy” Workshop here at CITP (sponsored by Adobe, Yahoo!, Google, Mozilla and Microsoft). If you were not able to join us for this event, I hope to summarize some of the discussion embodied in the roughly 60 position papers submitted.

The workshop attracted a wide range of participants; the agenda included advocates, academics, government, start-ups and established industry players from various sectors. Despite the broad name of the workshop, the discussion centered around “Do Not Track” (DNT) technologies and policy, essentially ways of ensuring that people have control, to some degree, over web profiling and tracking.

Unfortunately, I’m going to have to expect that you are familiar with the various proposals before going much further, as the workshop position papers are necessarily short and assume familiarity. (If you are new to this area, the CDT’s Alissa Cooper has a brief blog post from this past March, “Digging in on ‘Do Not Track'”, that mentions many of the discussion points. Technically, much of the discussion involved the mechanisms of the Mayer, Narayanan and Stamm IETF Internet-Draft from March and the Microsoft W3C member submission from February.)

Read on for more…