November 24, 2024

My Testimony on Behavioral Advertising

I’m testifying this morning at 10:00 AM (Eastern) at a Congressional hearing on “Behavioral Advertising: Industry Practices and Consumers’ Expectations”. It’s a joint hearing of two subcommittees of the House Committee on Energy and Commerce: the Subcommittee on Commerce, Trade, and Consumer Protection; and the Subcommittee on Communications, Technology, and the Internet .

Witnesses at the hearing are:

  • Jeffrey Chester, Executive Director, Center for Digital Democracy
  • Scott Cleland, President, Precursor LLC
  • Charles D. Curran, Executive Director, Network Advertising Initiative
  • Edward W. Felten, Professor of Computer Science and Public Affairs, Princeton University
  • Christopher M. Kelly, Chief Privacy Officer, Facebook
  • Anne Toth, Vice President of Policy, Head of Privacy, Yahoo! Inc.
  • Nicole Wong, Deputy General Counsel, Google Inc.

I submitted written testimony to the committee.

Look for a live webcast on the committee’s site.

On China's new, mandatory censorship software

The New York Times reports that China will start requiring censorship software on PCs. One interesting quote stands out:

Zhang Chenming, general manager of Jinhui Computer System Engineering, a company that helped create Green Dam, said worries that the software could be used to censor a broad range of content or monitor Internet use were overblown. He insisted that the software, which neutralizes programs designed to override China’s so-called Great Firewall, could simply be deleted or temporarily turned off by the user. “A parent can still use this computer to go to porn,” he said.

In this post, I’d like to consider the different capabilities that software like this could give to the Chinese authorities, without getting too much into their motives.

Firstly, and most obviously, this software allows the authorities to do filtering of web sites and network services that originate inside or outside of the Great Firewall. By operating directly on a client machine, this filter can be aware of the operations of Tor, VPNs, and other firewall-evading software, allowing connections to a given target machine to be blocked, regardless of how the client tries to get there. (You can’t accomplish “surgical” Tor and VPN filtering if you’re only operating inside the network. You need to be on the end host to see where the connection is ultimately going.)

Software like this can do far more, since it can presumably be updated remotely to support any feature desired by the government authorities. This could be the ultimate “Big Brother Inside” feature. Not only can the authorities observe behavior or scan files within one given computer, but every computer now because a launching point for investigating other machines reachable over a local area network. If one such machine were connected, for example, to a private home network, behind a security firewall, the government software could still scan every other computer on the same private network, log every packet, and so forth. Would you be willing to give your friends the password to log into your private wireless network, knowing their machine might be running this software?

Perhaps less ominously, software like this could also be used to force users to install security patches, to uninstall zombie/botnet systems, and perform other sorts of remote systems administration. I can’t imagine the difficulty in trying to run the Central Government Bureau of National Systems Administration (would they have a phone number you could call to complain when your computer isn’t working, and could they fix it remotely?), but the technological base is now there.

Of course, anybody who owns their own computer will be able to circumvent this software. If you control your machine, you can control what’s running on it. Maybe you can pretend to be running the software, maybe not. That would turn into a technological arms race which the authorities would ultimately fail to win, though they might succeed in creating enough fear, uncertainty, and doubt to deter would-be circumventors.

This software will also have a notable impact in Internet cafes, schools, and other sorts of “public” computing resources, which are exactly the sorts of places that people might go when they want to hide their identity, and where the authorities could have physical audits to check for compliance.

Big Brother is watching.

Fingerprinting Blank Paper Using Commodity Scanners

Today Will Clarkson, Tim Weyrich, Adam Finkelstein, Nadia Heninger, Alex Halderman and I released a paper, Fingerprinting Blank Paper Using Commodity Scanners. The paper will appear in the Proceedings of the IEEE Symposium on Security and Privacy, in May 2009.

Here’s the paper’s abstract:

This paper presents a novel technique for authenticating physical documents based on random, naturally occurring imperfections in paper texture. We introduce a new method for measuring the three-dimensional surface of a page using only a commodity scanner and without modifying the document in any way. From this physical feature, we generate a concise fingerprint that uniquely identifies the document. Our technique is secure against counterfeiting and robust to harsh handling; it can be used even before any content is printed on a page. It has a wide range of applications, including detecting forged currency and tickets, authenticating passports, and halting counterfeit goods. Document identification could also be applied maliciously to de-anonymize printed surveys and to compromise the secrecy of paper ballots.

Viewed under a microscope, an ordinary piece of paper looks like this:

The microscope clearly shows individual wood fibers, laid down in a pattern that is unique to this piece of paper.

If you scan a piece of paper on an ordinary desktop scanner, it just looks white. But pick a small area of the paper, digitally enhance the contrast and expand the image, and you see something like this:

The light and dark areas you see are due to two factors: inherent color variation in the surface, and partial shadows cast by fibers in the paper surface. If you rotate the paper and scan again, the inherent color at each point will be the same, but the shadows will be different because the scanner’s light source will strike the paper from a different angle. These differences allow us to map out the tiny hills and valleys on the surface of the paper.

Here is a visualization of surface shape from one of our experiments:

This part of the paper had the word “sum” printed on it. You can clearly see the raised areas where toner was applied to the paper to make the letters. Around the letters you can see the background texture of the paper.

Computing the surface texture is only one part of the job. From the texture, you want to compute a concise, secure “fingerprint” which can survive ordinary wear and tear on the paper, such as crumpling, scribbling or printing, and moisture. You also want to understand how secure the technology will be in various applications. Our full paper addresses these issues too. The bottom-line result is a sort of unique fingerprint for each piece of paper, which can be determined using an ordinary desktop scanner.

For more information, see the project website or our research paper.

More Privacy, Bit by Bit

Before the Holidays, Yahoo got a flurry of good press for the announcement that it would (as the LA Times puts it) “purge user data after 90 days.” My eagle-eyed friend Julian Sanchez noticed that the “purge” was less complete than privacy advocates might have hoped. It turns out that Yahoo won’t be deleting the contents of its search logs. Rather, it will merely be zeroing out the last 8 bits of users’ IP addresses. Julian is not impressed:

dropping the last byte of an IP address just means you’ve narrowed your search space down to (at most) 256 possibilities rather than a unique machine. By that standard, this post is anonymous, because I guarantee there are more than 255 other guys out there with the name “Julian Sanchez.”

The first three bytes, in the majority of cases, are still going to be enough to give you a service provider and a rough location. Assuming every address in the range is in use, dropping the least-significant byte just obscures which of the 256 users at that particular provider is behind each query. In practice, though, the search space is going to be smaller than that, because people are creatures of habit: You’re really working with the pool of users in that range who perform searches on Yahoo. If your not-yet-anonymized logs show, say, 45 IP addreses that match those first three bytes making routine searches on Yahoo (17.6% of the search market x 256 = 45) you can probably safely assume that an “anonymized” IP with the same three leading bytes is one of those 45. If different users tend to exhibit different usage patterns in search time, clustering of queries, expertise with Boolean operators, or preferred natural language, you can narrow it down further.

I think this isn’t quite fair to Yahoo. Dropping the last eight bits of the IP address certainly doesn’t protect privacy as much as deleting log entries entirely, but it’s far from useless. To start with, there’s often not a one-to-one correspondence between IP addresses and Internet users. Often a single user has multiple IPs. For example, when I connect to the Princeton wireless network, I’m dynamically assigned an IP address that may not be the same as the IP address I used the last time I logged on. I also access the web from my iPhone and from hotels and coffee shops when I travel. Conversely, several users on a given network may be sharing a single IP address using a technology called network address translation. So even if you know the IP address of the user who performed a particular search, that may simply tell you that the user works for a particular company or connected from a particular coffee shop. Hence, tracking a particular user’s online activities is already something of a challenge, and it becomes that much harder if several dozen users’ online activities are scrambled together in Yahoo!’s logs.

Now, whether this is “enough” privacy depends a lot on what kind of privacy problem you’re worried about. It seems to me that there are three broad categories of privacy concerns:

  • Privacy violations by Yahoo or its partners: Some people are worried that Yahoo itself is tracking their online activities, building an online profile about them, and selling this information to third parties. Obviously, Yahoo’s new policy will do little to allay such concerns. Indeed, as David Kravets points out, Yahoo will have already squeezed all the personal information it can out of those logs before it scours them. If you don’t trust Yahoo or its business partners, this move isn’t going to make you feel very much safer.
  • Data breaches: A second concern involves cases where customer data falls into the wrong hands due to a security breach. In this case, it’s not clear that search engine logs are especially useful to data thieves in the first place. Data thieves are typically looking for information such as credit card and Social Security numbers that can make them a quick buck. People rarely type such information into search boxes. Some searches may be embarrassing to users, but they probably won’t be so embarrassing as to enable blackmail or extortion. So search logs are not likely to be that useful to criminals, whether or not they are “anonymized.”
  • Court-ordered information release: This is the case where the new policy could have the biggest effect. Consider, for example, a case where the police seek a suspect’s search results. The new policy will help protect privacy in three ways: first, if Yahoo! can’t cleanly filter search logs by IP address, judges may be more reluctant to order the disclosure of several dozen users’ search results just to give police information from a single suspect. Second, scrubbing the last byte of the IP address will make searching through the data much more difficult. Finally, the resulting data will be less useful in the court of law, because prosecutors will need to convince a jury that a given search was performed by the defendant rather than another user who happened to have a similar IP address. At the margin, then, Yahoo’s new policy seems likely to significantly enhance user privacy against government information requests. The same principle applies in the case of civil suits: the recording and movie industries, for example, will have a harder time using Yahoo!’s search logs as evidence that a user was engaged in illegal file-sharing.

So based on the small amount of information Yahoo has made available, it seems that the new policy is a real, if small, improvement in users’ privacy. However, it’s hard to draw any definite conclusions without more specific information about what information Yahoo! is saving. Because anonymizing data is a lot harder than people think. AOL learned this the hard way in 2006 when “anonymized” search results were released to researchers. People quickly noticed that you could figure out who various users were by looking at the contents of their searches. The data wasn’t so anonymous after all.

One reason AOL’s data wasn’t so anonymous is that AOL had “anonymized” the data set by assigning each user a unique ID. That meant people could look at all searches made by a single user and find searches that gave clues to the user’s identity. Had AOL instead stripped off the user information without replacing it, it would have been much harder to de-anonymize the data because there would be no way to match up different searches by the same user. If Yahoo’s logs include information linking each user’s various searches together, then even deleting the IP address entirely probably won’t be enough to safeguard user privacy. On the other hand, if the only user-identifying information is the IP address, then stripping off the low byte of the IP address is a real, if modest, privacy enhancement.

Can Google Flu Trends Be Manipulated?

Last week researchers from Google and the Centers for Disease Control unveiled a cool new research result, showing that they could gauge the level of influenza infections in a region of the U.S. by seeing how often people in those regions did Google searches for certain terms related to the flu and flu symptoms. The search-based predictions correlate remarkably well with the medical data on flu rates — not everyone who searches for “cough medicine” has the flu, but enough do that an increase in flu cases correlates with an increase in searches for “cough medicine” and similar terms. The system is called Google Flu Trends.

Privacy groups have complained, but this use of search data seems benign — indeed, this level of flu detection requires only that search data be recorded per region, not per individual user. The legitimate privacy worry here is not about the flu project as it stands today but about other uses that Google or the government might find for search data later.

My concern today is whether Flu Trends can be manipulated. The system makes inferences from how people search, but people can change their search behavior. What if a person or a small group set out to convince Flu Trends that there was a flu outbreak this week?

An obvious approach would be for the conspirators to do lots of searches for likely flu-related terms, to inflate the count of flu-related searches. If all the searches came from a few computers, Flu Trends could presumably detect the anomalous pattern and the algorithm could reduce the influence of these few computers. Perhaps this is already being done; but I don’t think the research paper mentions it.

A more effective approach to spoofing Flu Trends would be to use a botnet — a large collection of hijacked computers — to send flu-related searches to Google from a larger number of computers. If the added searches were diffuse and well-randomized, they would be very hard to distinguish from legitimate searches, and the Flu Trends would probably be fooled.

This possibility is not discussed in the Flu Trends research paper. The paper conspicuously fails to identify any of the search terms that the system is looking for. Normally a paper would list the terms, or at least give examples, but none of the terms appear in the paper, and the Flu Trends web site gives only “flu” as an example search term. They might be withholding the search terms to make manipulation harder, but more likely they’re withholding the search terms for business reasons, perhaps because the terms have value in placing or selling ads.

Why would anyone want to manipulate Flu Trends? If flu rates affect the financial markets by moving the stock prices of certain drug or healthcare companies, then a manipulator can profit by sending false signals about flu rates.

The most interesting question about Flu Trends, though, is what other trends might be identifiable via search terms. Government might use similar methods to look for outbreaks of more virulent diseases, and businesses might look for cultural trends. In all of these cases, manipulation will be a risk.

There’s an interesting analogy to web linking behavior. When the web was young, people put links in their sites to point readers to other interesting sites. But when Google started inferring sites’ importance from their incoming links, manipulators started creating links for their Google-effect. The result was an ongoing cat-and-mouse game between search engines and manipulators. The more search behavior takes on commercial value, the more manipulators will want to change search behavior for commercial or cultural advantage.

Anything that is valuable to measure is probably, to someone, valuable to manipulate.