May 26, 2024

Are genomes "anonymous data"?

Recently researchers showed that an unknown person’s genome (i.e., the genetic information stored in their DNA) can often be linked to their identity. The researchers used the genome plus some publicly available information to link this information. Just as interesting as the result itself is the way that people talked about it. As an example, here’s the opening paragraph of Gina Kolata’s New York Times story:

The genetic data posted online seemed perfectly anonymous — strings of billions of DNA letters from more than 1,000 people. But all it took was some clever sleuthing on the Web for a genetics researcher to identify five people he randomly selected from the study group. Not only that, he found their entire families, even though the relatives had no part in the study — identifying nearly 50 people.

Why would a genome “seem[] perfectly anonymous”? The genome is almost certainly unique to one person. So at the very least, the genome is a pseudonym. But of course the genome is also correlated with all sorts of physical characteristics of the person that are visible. And police use DNA evidence (parts of a genome) to identify people all the time. That’s hardly anonymous.

So why then would the genome seem perfectly anonymous? That statement can only mean that although people knew the linkage was possible, they thought it would be difficult to link in practice–which turned out to be wrong.

In my experience, true experts tend to talk about these issues differently. The usual expert viewpoint is that if information is linkable in principle, then it will probably turn out to be linkable in practice. If you want to argue otherwise–that information that is linked in principle cannot be linked in practice–then you are expected to give a technically sound reason for the gap, to identify a specific technical barrier that will stand in the way of every possible linkage method. “It seems really hard” or “I can’t think of a way” are not convincing arguments.

In the language of lawyers, in expert discourse there is a presumption that linkable-in-principle implies linkable-in-fact. This presumption can be rebutted by a technically sound argument to the contrary, but without a sound rebuttal data is considered to be probably linkable. The growing body of research on “re-identification” shows that the presumption of linkability is well justified.

Of course, there are good arguments for non-linkability in particular cases. The presumption is rebuttable, and not a hard-and-fast rule, because important exceptions do exist. The point is simply that experts put the burden of argument on those who would assert non-linkability.

If you’re familiar with privacy law, you know that the law often takes an opposite approach. To be blunt, people make unsubstantiated claims that data are “anonymous” or “not identifiable” all the time. Many of these claims are wrong.

How to reconcile privacy law and policy with the findings of 21st-century privacy research is a big question, one that I couldn’t hope to answer, or even really get started answering, in this post. It’s a topic to which I expect to return in future posts and papers.


  1. A recent direction to protect privacy of human genomes has been taken by the security/cryptography community. I encourage the readers to take a look at this position paper: Whole Genome Sequencing: Innovation Dream or Privacy Nightmare? ( as well as some papers presenting a few privacy-enhancing technologies in this space (