January 22, 2025

New Research Result: Bubble Forms Not So Anonymous

Today, Joe Calandrino, Ed Felten and I are releasing a new result regarding the anonymity of fill-in-the-bubble forms. These forms, popular for their use with standardized tests, require respondents to select answer choices by filling in a corresponding bubble. Contradicting a widespread implicit assumption, we show that individuals create distinctive marks on these forms, allowing use of the marks as a biometric. Using a sample of 92 surveys, we show that an individual’s markings enable unique re-identification within the sample set more than half of the time. The potential impact of this work is as diverse as use of the forms themselves, ranging from cheating detection on standardized tests to identifying the individuals behind “anonymous” surveys or election ballots.

If you’ve taken a standardized test or voted in a recent election, you’ve likely used a bubble form. Filling in a bubble doesn’t provide much room for inadvertent variation. As a result, the marks on these forms superficially appear to be largely identical, and minor differences may look random and not replicable. Nevertheless, our work suggests that individuals may complete bubbles in a sufficiently distinctive and consistent manner to allow re-identification. Consider the following bubbles from two different individuals:

These individuals have visibly different stroke directions, suggesting a means of distinguishing between both individuals. While variation between bubbles may be limited, stroke direction and other subtle features permit differentiation between respondents. If we can learn an individual’s characteristic features, we may use those features to identify that individual’s forms in the future.

To test the limits of our analysis approach, we obtained a set of 92 surveys and extracted 20 bubbles from each of those surveys. We set aside 8 bubbles per survey to test our identification accuracy and trained our model on the remaining 12 bubbles per survey. Using image processing techniques, we identified the unique characteristics of each training bubble and trained a classifier to distinguish between the surveys’ respondents. We applied this classifier to the remaining test bubbles from a respondent. The classifier orders the candidate respondents based on the perceived likelihood that they created the test markings. We repeated this test for each of the 92 respondents, recording where the correct respondent fell in the classifier’s ordered list of candidate respondents.

If bubble marking patterns were completely random, a classifier could do no better than randomly guessing a test set’s creator, with an expected accuracy of 1/92 ? 1%. Our classifier achieves over 51% accuracy. The classifier is rarely far off: the correct answer falls in the classifier’s top three guesses 75% of the time (vs. 3% for random guessing) and its top ten guesses more than 92% of the time (vs. 11% for random guessing). We conducted a number of additional experiments exploring the information available from marked bubbles and potential uses of that information. See our paper for details.

Additional testing—particularly using forms completed at different times—is necessary to assess the real-world impact of this work. Nevertheless, the strength of these preliminary results suggests both positive and negative implications depending on the application. For standardized tests, the potential impact is largely positive. Imagine that a student takes a standardized test, performs poorly, and pays someone to repeat the test on his behalf. Comparing the bubble marks on both answer sheets could provide evidence of such cheating. A similar approach could detect third-party modification of certain answers on a single test.

The possible impact on elections using optical scan ballots is more mixed. One positive use is to detect ballot box stuffing—our methods could help identify whether someone replaced a subset of the legitimate ballots with a set of fraudulent ballots completed by herself. On the other hand, our approach could help an adversary with access to the physical ballots or scans of them to undermine ballot secrecy. Suppose an unscrupulous employer uses a bubble form employment application. That employer could test the markings against ballots from an employee’s jurisdiction to locate the employee’s ballot. This threat is more realistic in jurisdictions that release scans of ballots.

Appropriate mitigation of this issue is somewhat application specific. One option is to treat surveys and ballots as if they contain identifying information and avoid releasing them more widely than necessary. Alternatively, modifying the forms to mask marked bubbles can remove identifying information but, among other risks, may remove evidence of respondent intent. Any application demanding anonymity requires careful consideration of options for preventing creation or disclosure of identifying information. Election officials in particular should carefully examine trade-offs and mitigation techniques if releasing ballot scans.

This work provides another example in which implicit assumptions resulted in a failure to recognize a link between the output of a system (in this case, bubble forms or their scans) and potentially sensitive input (the choices made by individuals completing the forms). Joe discussed a similar link between recommendations and underlying user transactions two weeks ago. As technologies advance or new functionality is added to systems, we must explicitly re-evaluate these connections. The release of scanned forms combined with advances in image analysis raises the possibility that individuals may inadvertently tie themselves to their choices merely by how they complete bubbles. Identifying such connections is a critical first step in exploiting their positive uses and mitigating negative ones.

This work will be presented at the 2011 USENIX Security Symposium in August.

Tracking Your Every Move: iPhone Retains Extensive Location History

Today, Pete Warden and Alasdair Allan revealed that Apple’s iPhone maintains an apparently indefinite log of its location history. To show the data available, they produced and demoed an application called iPhone Tracker for plotting these locations on a map. The application allows you to replay your movements, displaying your precise location at any point in time when you had your phone. Their open-source application works with the GSM (AT&T) version of the iPhone, but I added changes to their code that allow it to work with the CDMA (Verizon) version of the phone as well.

When you sync your iPhone with your computer, iTunes automatically creates a complete backup of the phone to your machine. This backup contains any new content, contacts, and applications that were modified or downloaded since your last sync. Beginning with iOS 4, this backup also included is a SQLite database containing tables named ‘CellLocation’, ‘CdmaCellLocaton’ and ‘WifiLocation’. These correspond to the GSM, CDMA and WiFi variants of location information. Each of these tables contains latitude and longitude data along with timestamps. These tables also contain additional fields that appear largely unused on the CDMA iPhone that I used for testing — including altitude, speed, confidence, “HorizontalAccuracy,” and “VerticalAccuracy.”

Interestingly, the WifiLocation table contains the MAC address of each WiFi network node you have connected to, along with an estimated latitude/longitude. The WifiLocation table in our two-month old CDMA iPhone contains over 53,000 distinct MAC addresses, suggesting that this data is stored not just for networks your device connects to but for every network your phone was aware of (i.e. the network at the Starbucks you walked by — but didn’t connect to).

Location information persists across devices, including upgrades from the iPhone 3GS to iPhone 4, which appears to be a function of the migration process. It is important to note that you must have physical access to the synced machine (i.e. your laptop) in order to access the synced location logs. Malicious code running on the iPhone presumably could also access this data.

Not only was it unclear that the iPhone is storing this data, but the rationale behind storing it remains a mystery. To the best of my knowledge, Apple has not disclosed that this type or quantity of information is being stored. Although Apple does not appear to be currently using this information, we’re curious about the rationale for storing it. In theory, Apple could combine WiFi MAC addresses and GPS locations, creating a highly accurate geolocation service.

The exact implications for mobile security (along with forensics and law enforcement) will be important to watch. What is most surprising is that this granularity of information is being stored at such a large scale on such a mainstream device.