Today’s New York Times has an interesting article by Katie Hafner on AOL’s now-infamous release of customers’ search data.
AOL’s goal in releasing the data was to help researchers by giving them realistic data to study. Today’s technologies, such as search engines, have generated huge volumes of information about what people want online and why. But most of this data is locked up in the data centers of companies like AOL, Google, and eBay, where researchers can’t use it. So researchers have been making do with a few old datasets. The lack of good data is certainly holding back progress in this important area. AOL wanted to help out by giving researchers a better dataset to work with.
Somebody at AOL apparently thought they had “anonymized” the data by replacing the usernames with meaningless numbers. That was a terrible misjudgement – if there is one thing we have learned from the AOL data, it is that people reveal a lot about themselves in their search queries. Reporters have identified at least two of the affected AOL users by name, and finding and publishing embarrassing search sequences has become a popular sport.
The article quotes some prominent researchers, including Jon Kleinberg, saying they’ll refuse to work with this data on ethical grounds. I don’t quite buy that there is an ethical duty to avoid research uses of the data. If I had a valid research use for it, I’m pretty sure I could develop my own guidelines for using it without exacerbating the privacy problem. If I had had something to do with inducing the ill-fated release of the data, I might have an obligation to avoid profiting from my participation in the release. But if the data is out there due to no fault of mine, and the abuses that occur are no fault of mine, why shouldn’t I be able to use the data responsibly, for the public good?
Researchers know that this incident will make companies even more reluctant to release data, even after anonymizing it. If you’re a search-behavior expert, this AOL data may be the last useful data you see for a long time – which is all the more reason to use it.
Most of all, the AOL search data incident reminds us of the complexity of identity and anonymity online. It should have been obvious that removing usernames wasn’t enough to anonymize the data. But this is actually a common kind of mistake – simplistic distinctions between “personally identifiable information” and other information pervade the policy discussion about privacy. The same error is common in debates about big government data mining programs – it’s not as easy as you might think to enable data analysis without also compromising privacy.
In principle, it might have been possible to transform the search data further to make it safe for release. In practice we’re nowhere near understanding how to usefully depersonalize this kind of data. That’s an important research problem in itself, which needs its own datasets to work on. If only somebody had released a huge mass of poorly depersonalized data …
We are living in a data minefield
[S]houldn’t we also have a debate about the ownership of those tracks?
Dennis,
Speaking generally, and normatively, short words and phrases, such as those used in a typical search-engine query, should not be copyrightable. Their selection, coordination or arrangement (see 17 USC § 101) might be properly subject to copyright, but only to the extent that “the resulting work as a whole constitutes an original work of authorship.” To the extent that the selection, coordination or arrangement of short words and phrases is essentially random, or merely ordered by sequence of time, the work as a whole should not be copyrightable.
“Suppose for a moment we grant that consent is not a problem.
Scott,
No. Let’s not.”
Clearly, my point is that even without considering consent we can rule out the use of this data.
After posting that comment, I had a feeling someone would read that sentence this way, but I don’t mean that consent isn’t a problem. I am just separating consent from the other factors, to examine them individually.
The real issue here isn’t the release of the data, it is the collection of the data.
If it is ethically wrong for AOL to have released it, and ethically wrong for researchers to work on this data, then it was also ethically wrong for AOL to collect and hold this data, and ethically wrong for AOL’s internal researchers and engineers to work on the data.
This whole AOL situation ha reminded me that, as we all leave permanent tracks as we traverse the Web, others are here to make use of our tracks – for good or for ill.
Some benefit from the buying and selling of information about the tracks that we leave. If we have a debate about the “privacy” of the tracks that we leave (such as our search queries) shouldn’t we also have a debate about the ownership of those tracks?
And by “ownership” I am explicitly referring to the ability to exclude some from use of the tracks, and to the ability to profit from the use of the tracks if I desire.
Suppose for a moment we grant that consent is not a problem.
Scott,
No. Let’s not.
No matter what language is used in the AOL agreement, people have a right to rely on Time Warner’s well-publicized representations regarding the meaning of that language. And, as CNET‘s Dawn Kawamoto reported on August 7, 2006, under the headline, “AOL apologizes for release of user search data”, Time Warner’s position is:
Given this public representation by Time Warner about the AOL service, no AOL user has consented to use of their search data.
““I think the only real privacy-preserving solution is to keep the actual dataset on a restricted server inaccessible to the researchers or adversaries; and they can submit programs or queries to compute whatever results they are looking for.â€
That really wouldn’t help researchers if the search programs and query tools haven’t even been written or defined yet. Since we are still in the early stages of this kind of research, no one really knows what the research programs or query tools should look like or how they should work. So I’m not sure that your solution would actually be workable at this stage. Not that I can come up with a better solution, but I thought I would point this out.”
The solution to that is obvious: use dummy data in the same format to develop research tools that can later be used on the real data.
Okay, I just asked our own Human Subjects Research Review Committee.
The person on the phone identified the major ethical factors in having the data as (a) risk to subjects, and (b) personally identifying information. This is versus (c) how widely available the data is.
Of course there is (d) consent, but this is a trickier matter because people who use AOL are technically consenting to all sorts of use of their data, although not overtly. Suppose for a moment we grant that consent is not a problem.
Risk can be high, if you can determine and publicize the identity of people who perform certain searches, e.g. someone searching for information about getting an abortion. Identifiability is pretty obvious.
Does it matter that the data is already all over the place? The HSR office did say that it helps if info is already publicly available—but I suspect that availability due to a leak is a different matter ethically. I guess this also affects risk—what is the risk to subjects of being the nth person with a copy?
In the end the HSR person said it would be bad on the basis of identifying information. The bottom line was that having the data might be OK depending on circumstances, but publishing any research would technically involve HSR approval of the data, which is not likely. I never thought about this before: that Human Subjects Review is not just a factor in collecting data, but publishing results that use it.
I also asked in general: if some other institution collects human subjects data unethically, what is University policy on using it? I got the same basic answer: “boo to that” (I paraphrase).
Now, the HSR office is probably thinking defensively. Asking them might be like asking a legal department. I’d also be interested in what answers other folk get from their own HSR committees.
If you were to submit a proposal to your university IRB (ethics review board for experiments) you would find that use of the data, as is, would be highly restricted or disallowed because it provides means to identify the participants without their informed consent.
Rules for use of personally identifiable information (PII) in experiments and studies are fairly well established in biological and sociologic research, and defined in international conventions and Federal law. Few computer scientists seem to be aware of them….but should be.
Hi,
Research involving people usually does require a Human Subjects review, and it is easy to forget how universally this applies. For example, if you have a project to write educational software for your circuits class, any assessment of how well it works will probably require Human Subjects review.
There is a parallel to AOL here: we tend to think that Human Subjects review is for obvious experiments on people, like psychology experiments; thus we might break the rules because we do not realize they apply to our own area. Likewise, the people who divulged the AOL data probably knew that there was a privacy policy somewhere regarding the data, but probably did not think that it applied to R&D.
But, I do not know if Human Subjects rules apply to data that has already been collected elsewhere. The picture would be clearer if the data from elsewhere was collected with proper consent.
Actually, let me ask.
Just curious – how many of you who say you could “find a way” to use the AOL dataset for research have to pass a formal Human Subjects review? Although I’m not an academic, I’ve been around academia enough to know that the lack of consent seems to preclude any research use of the data by accredited researchers.
I think this is an abomination, guys.
“That really wouldn’t help researchers if the search programs and query tools haven’t even been written or defined yet.”
Search programs and query tools? We don’t need any of that.
Just tell me the file format of the data. I write a Java program to process the data to determine what I want to know. You run this on an isolated machine, where the search data is accessible on a read-only volume.
Or, if the data is stored in a relational database, I would submit SQL queries instead of a java program.
Either way, we agree on some limit on the output size that allows me to get the summary data I need, but impedes the wholesale harvesting of private data. This doesn’t guarantee security against a privacy breach, but at least it makes for a thin straw.
We don’t need to wait for “research programs” or “query tools” before this can be done. This is something we could set up with a typical Unix installation out of the box.
I have a copy of the data that I pulled off of bit torrent two weeks after it hit the news, evil grin.
“I think the only real privacy-preserving solution is to keep the actual dataset on a restricted server inaccessible to the researchers or adversaries; and they can submit programs or queries to compute whatever results they are looking for.”
That really wouldn’t help researchers if the search programs and query tools haven’t even been written or defined yet. Since we are still in the early stages of this kind of research, no one really knows what the research programs or query tools should look like or how they should work. So I’m not sure that your solution would actually be workable at this stage. Not that I can come up with a better solution, but I thought I would point this out.
I think the only real privacy-preserving solution is to keep the actual dataset on a restricted server inaccessible to the researchers or adversaries; and they can submit programs or queries to compute whatever results they are looking for.
This makes the researcher’s job a bit harder, because sometimes we don’t know exactly what we’re looking for. And it doesn’t guarantee security either, because with the right set of queries I could pull out a lot of the raw data. Some guidelines for allowable output would be needed to at least make privacy leakage a slow and inconvenient process.
Perhaps something akin to the Nuremberg Code needs to be developed for privacy?
Yup. See Eszter Hargittai’s post on Crooked Timber from two weeks ago.