August 20, 2018

What should we do about re-identification? A precautionary approach to big data privacy

Computer science research on re-identification has repeatedly demonstrated that sensitive information can be inferred even from de-identified data in a wide variety of domains. This has posed a vexing problem for practitioners and policy makers. If the absence of “personally identifying information” cannot be relied on for privacy protection, what are the alternatives? Joanna Huey, Ed Felten, and I tackle this question in a new paper “A Precautionary Approach to Big Data Privacy”. Joanna presented the paper at the Computers, Privacy & Data Protection conference earlier this year.

Here are some of the key recommendations we make.

  1. When data is released after applying current ad-hoc de-identification methods, the privacy risks of re-identification are not just unknown but unknowable. This is in contrast to provable privacy techniques like differential privacy. We therefore call for a weak version of the precautionary approach in which the burden of proof falls on data releasers. We recommend that they should be incentivized not to default to full, public releases of datasets using ad hoc de-identification methods.
  2. Policy makers have several levers to influence data releases: research funding choices that incentivize collaboration between privacy theorists and practitioners, mandated transparency of re-identification risks, and innovation procurement—using government demand to drive the development and diffusion of advanced privacy technologies.
  3. Meanwhile, practitioners and policymakers have numerous pragmatic options for narrower releases of data. We present advice for six of the most common use cases for sharing data. Our thesis is that the problem of “what to do about re-identification” unravels once we stop looking for a one-size-fits-all solution, and in each of the six cases we propose a solution that is tailored, yet principled.

Our work draws from an earlier piece by Ed Felten and me last year titled “No silver bullet: De-identification still doesn’t work”. Building on the arguments we made there, we point out two nuances of re-identification that are often missed in the policy discussion. First, we explain why privacy risks exist even if re-identification doesn’t succeed in the stereotypical sense. Second, we draw a distinction between “broad” and “targeted” attacks which we’ve found to be a frequent source of confusion.

As datasets get larger, more connected, and capture ever more intimate details of our lives, getting data privacy right continues to gain in importance. While de-identification is often useful as an additional privacy measure, by itself it offers insufficient and unreliable privacy protection. It is crucial for data custodians, policy makers, and regulators to embrace a more technologically sound approach to big data privacy, and we hope that our recommendations will contribute to this shift.