Before the Holidays, Yahoo got a flurry of good press for the announcement that it would (as the LA Times puts it) “purge user data after 90 days.” My eagle-eyed friend Julian Sanchez noticed that the “purge” was less complete than privacy advocates might have hoped. It turns out that Yahoo won’t be deleting the contents of its search logs. Rather, it will merely be zeroing out the last 8 bits of users’ IP addresses. Julian is not impressed:
dropping the last byte of an IP address just means you’ve narrowed your search space down to (at most) 256 possibilities rather than a unique machine. By that standard, this post is anonymous, because I guarantee there are more than 255 other guys out there with the name “Julian Sanchez.”
The first three bytes, in the majority of cases, are still going to be enough to give you a service provider and a rough location. Assuming every address in the range is in use, dropping the least-significant byte just obscures which of the 256 users at that particular provider is behind each query. In practice, though, the search space is going to be smaller than that, because people are creatures of habit: You’re really working with the pool of users in that range who perform searches on Yahoo. If your not-yet-anonymized logs show, say, 45 IP addreses that match those first three bytes making routine searches on Yahoo (17.6% of the search market x 256 = 45) you can probably safely assume that an “anonymized” IP with the same three leading bytes is one of those 45. If different users tend to exhibit different usage patterns in search time, clustering of queries, expertise with Boolean operators, or preferred natural language, you can narrow it down further.
I think this isn’t quite fair to Yahoo. Dropping the last eight bits of the IP address certainly doesn’t protect privacy as much as deleting log entries entirely, but it’s far from useless. To start with, there’s often not a one-to-one correspondence between IP addresses and Internet users. Often a single user has multiple IPs. For example, when I connect to the Princeton wireless network, I’m dynamically assigned an IP address that may not be the same as the IP address I used the last time I logged on. I also access the web from my iPhone and from hotels and coffee shops when I travel. Conversely, several users on a given network may be sharing a single IP address using a technology called network address translation. So even if you know the IP address of the user who performed a particular search, that may simply tell you that the user works for a particular company or connected from a particular coffee shop. Hence, tracking a particular user’s online activities is already something of a challenge, and it becomes that much harder if several dozen users’ online activities are scrambled together in Yahoo!’s logs.
Now, whether this is “enough” privacy depends a lot on what kind of privacy problem you’re worried about. It seems to me that there are three broad categories of privacy concerns:
- Privacy violations by Yahoo or its partners: Some people are worried that Yahoo itself is tracking their online activities, building an online profile about them, and selling this information to third parties. Obviously, Yahoo’s new policy will do little to allay such concerns. Indeed, as David Kravets points out, Yahoo will have already squeezed all the personal information it can out of those logs before it scours them. If you don’t trust Yahoo or its business partners, this move isn’t going to make you feel very much safer.
- Data breaches: A second concern involves cases where customer data falls into the wrong hands due to a security breach. In this case, it’s not clear that search engine logs are especially useful to data thieves in the first place. Data thieves are typically looking for information such as credit card and Social Security numbers that can make them a quick buck. People rarely type such information into search boxes. Some searches may be embarrassing to users, but they probably won’t be so embarrassing as to enable blackmail or extortion. So search logs are not likely to be that useful to criminals, whether or not they are “anonymized.”
- Court-ordered information release: This is the case where the new policy could have the biggest effect. Consider, for example, a case where the police seek a suspect’s search results. The new policy will help protect privacy in three ways: first, if Yahoo! can’t cleanly filter search logs by IP address, judges may be more reluctant to order the disclosure of several dozen users’ search results just to give police information from a single suspect. Second, scrubbing the last byte of the IP address will make searching through the data much more difficult. Finally, the resulting data will be less useful in the court of law, because prosecutors will need to convince a jury that a given search was performed by the defendant rather than another user who happened to have a similar IP address. At the margin, then, Yahoo’s new policy seems likely to significantly enhance user privacy against government information requests. The same principle applies in the case of civil suits: the recording and movie industries, for example, will have a harder time using Yahoo!’s search logs as evidence that a user was engaged in illegal file-sharing.
So based on the small amount of information Yahoo has made available, it seems that the new policy is a real, if small, improvement in users’ privacy. However, it’s hard to draw any definite conclusions without more specific information about what information Yahoo! is saving. Because anonymizing data is a lot harder than people think. AOL learned this the hard way in 2006 when “anonymized” search results were released to researchers. People quickly noticed that you could figure out who various users were by looking at the contents of their searches. The data wasn’t so anonymous after all.
One reason AOL’s data wasn’t so anonymous is that AOL had “anonymized” the data set by assigning each user a unique ID. That meant people could look at all searches made by a single user and find searches that gave clues to the user’s identity. Had AOL instead stripped off the user information without replacing it, it would have been much harder to de-anonymize the data because there would be no way to match up different searches by the same user. If Yahoo’s logs include information linking each user’s various searches together, then even deleting the IP address entirely probably won’t be enough to safeguard user privacy. On the other hand, if the only user-identifying information is the IP address, then stripping off the low byte of the IP address is a real, if modest, privacy enhancement.
Moves like this are a small step in the right direction. I’m not as concerned about data breaches in the context of search, and data handed over after due process is legitimate. But the biggest breach of all is the extensive commercial use of personal data, without the opt-in consent of users & the ability to review, correct, and/or delete our commercial dossiers. Search engine companies (and to a larger extent, ISPs) have the ability to correlate search logs with IP addresses, cookies, Flash stored local objects, ad tracking server data, and logged-in site activities. Further, they can potentially combine this information with other purchased or partner data to build extremely extensive dossiers in citizens. This ability, coupled with the government’s ability to either copy the raw data or purchase commercial data and integrate it with government databases, is a significant invasion of privacy and other liberties.