This morning I’m testifying at a hearing of the Privacy and Civil Liberties Oversight Board, on the topic of “Defining Privacy”. Here is the text of my oral testimony. (This is the text as prepared; there might be minor deviations when I deliver it.) [Update (Nov. 16): video stream of my panel is now available.]
Thank you for the opportunity to testify today.
My name is Ed Felten. I am the Kahn Professor of Computer Science and Public Affairs at Princeton University. Today I will offer my perspective as a computer scientist on how changing data practices have affected how we think about privacy. I will talk about both commercial and government data practices because the two are closely connected.
We can think of today’s data practices in terms of a three-stage pipeline: first, collect data; second, merge data items; and third, analyze the data to infer facts about people.
The first stage collects information. In our daily lives, we disclose information directly to people and organizations. Even when we are not disclosing information explicitly, more and more of what we do, online and offline, is recorded. Online services often attach unique identifiers to us and our devices, and the records of what we do are tagged with those identifiers.
The second stage of the pipeline merges the data. Information might be collected in larger or smaller units. But if two data files can be determined to correspond to the same person—for example, because they both contain the same unique identifier—the two files can be merged. Merging can create an avalanche effect—merged files convey more precise knowledge about a user’s identity and unique behaviors, and this precision helps to enable further merging. One file might contain detailed information about behavior; another might precisely identify a person; merging the two will link behavior to identity.
The third stage of the pipeline uses big data methods such as predictive analytics to infer facts about people. One famous example is when the retailer Target used purchases of products such as skin lotion to infer pregnancy. Today’s machine learning methods often enable sensitive information to be inferred from seemingly less sensitive data. Inferences also have an avalanche effect—each inference becomes another data point to be used in making further inferences.
Predictive analytics are most effective in inferring status when many positive and negative examples are given. For example, Target used many examples of both pregnant and non-pregnant women to build its predictive model. By contrast, a predictive model that tried to identify terrorists from everyday behavioral data would expect much less success because there are very few examples of known terrorists in the U.S. population.
With that technical background, let me discuss a few implications.
First, the consequences of collecting a data item can be difficult to predict. Even if an item, on its face, does not seem to convey identifying information, and even if its contents seem harmless in isolation, its collection could have significant downstream effects. We must account for the mosaic effect—in which isolated, seemingly unremarkable data items combine to paint a vivid, specific picture. One of the main lessons of recent technical scholarship on privacy is the power of the mosaic effect.
To understand what follows from collecting an item, we have to think about how that item can be merged with other available data, and how the merged data can in turn be used to infer information about people. We have to take into account the avalanche effects that can occur in both the merging and the inference stages. For example, the information that the holder of a certain loyalty card account number purchased skin lotion on a certain date might turn out to be the key fact that leads the retailer to an inference that a specific identifiable woman is pregnant. Similarly, phone call metadata, when collected in large volume, has been shown to enable predictions about social status, affiliation, employment, health, and personality traits.
The second implication is that data handling systems have gotten much more complicated, especially in the merging and analysis stages—that is, the stages that come after collection. The sheer complexity of these systems makes it very difficult to understand, predict and control their use. Even the people who build and run these systems often fail to understand fully how they work—and this leads to unpleasant surprises such as compliance failures or data breaches.
Complexity frustrates oversight, and makes compliance more difficult. Complexity makes failure more likely. Despite all best intentions, organizations often find themselves out of compliance with their own policies and obligations. Policymaking should acknowledge the fact that complex systems will often fail to perform as desired.
Complex rules also make compliance more difficult. It is sometimes argued that we should abandon controls on collection and focus only on regulating data use. Limits on use offer more flexibility and precision—in theory, and sometimes in practice. But collection limits have advantages too. For example, it is easier to comply with a rule that limits collection than one that allows collection and puts elaborate limits on post-collection use; and collection limits make oversight and enforcement easier. Limiting collection can also nudge agencies to develop innovative approaches that meet their analytic needs while collecting less.
The third implication is the synergy between commercial and government data practices. As an example, commercial entities put unique identifiers into most website accesses. A government eavesdropper collecting traffic can use these identifiers to link a user’s activities across different times and different online sites, and the eavesdropper can connect those activities to identifying information. Our research shows that, even if a user switches locations and devices, as most users do, an eavesdropper exploiting commercially-placed identifiers can reconstruct 60-75% of what the user does online and can usually link that data to the user’s identity.
Users can engage in technical self-help against commercial data collection—and this works, up to a point. However, the people most likely to use these self-help tools are intelligence targets, so the commercial-government synergy is likely to sweep up more information about ordinary Americans than it does about intelligence targets.
My final point is that technology offers many options beyond the most obvious technological approach of collecting all of the data, aggregating it in a single large data center, and then running canned analysis scripts on it. Advanced technical methods exist that can support necessary types of inference and analysis while collecting less data and more aggressively minimizing or preprocessing the data that is collected. It is often possible to allow limited uses of a data set without turning over the entire data set, or to keep data sets separate while allowing their contents to be combined in controlled ways. For example, cryptographic methods allow two parties who have separate data sets to find people who appear in both data sets, without disclosing their data to each other. There is a large and growing literature on privacy-preserving data analysis methods.
Determining whether collection of particular data is truly necessary, whether data retention is truly needed, and what can be inferred from a particular analysis—these involve deeply technical questions. An oversight body should engage with these questions, using independent technical experts as needed.
In the same way that the Board asks probing legal and policy questions of the agencies you oversee, I hope you will build a capacity to ask equally probing technical questions. Legal and policy oversight are most effective when combined with sophisticated and accurate technical analysis. Many independent technical experts and groups are able and willing to help you build this capacity.
Thank you for your time. I look forward to your questions.