[Today we have another announcement of an exciting new research paper. Undergraduate Dillon Reisman, for his senior thesis, applied our web measurement platform to study some timely questions. -Arvind Narayanan]
Over the past three months we’ve learnt that NSA uses third-party tracking cookies for surveillance (1, 2). These cookies, provided by a third-party advertising or analytics network (e.g. doubleclick.com, scorecardresearch.com), are ubiquitous on the web, and tag users’ browsers with unique pseudonymous IDs. In a new paper, we study just how big a privacy problem this is. We quantify what an observer can learn about a user’s web traffic by purely passively eavesdropping on the network, and arrive at surprising answers.
At first sight it doesn’t seem possible that eavesdropping alone can reveal much. First the eavesdropper on the Internet backbone sees millions of HTTP requests and responses. How can he associate the third-party HTTP request containing a user’s cookie with request to the first-party web page that the browser visited, which doesn’t contain the cookie? Second, how can visits to different first parties be linked to each other? And finally, even if all the web traffic for a single user can be linked together, how can the adversary go from a set pseudonymous cookies to the user’s real-world identity?
The diagram illustrates how the eavesdropper can use multiple third-party cookies to link traffic. When a user visits ‘www.exampleA.com,’ the response contains the embedded tracker X, with an ID cookie ‘xxx’. The visits to exampleA and to X are tied together by IP address, which typically doesn’t change within a single page visit . Another page visited by the same user might embed tracker Y bearing the pseudonymous cookie ‘yyy’. If the two page visits were made from different IP addresses, an eavesdropper seeing these cookies can’t tell that the same browser made both visits. But if a third page, however, embeds both trackers X and Y, then the eavesdropper will know that IDs ‘xxx’ and ‘yyy’ belong to the same user. This method applied iteratively has the potential of tying together a lot of the traffic of a single user.
Once we had this idea, we wanted to test if it would actually work in practice. Everything depends on just how densely third-party trackers are actually embedded on sites. We conducted automated web crawls of 65 simulated users’ web browsing over three months, and found that unique cookies are so prevalent that the eavesdropper can reliably link 90% of a user’s web page visits to the same pseudonymous ID. (We omitted pages that embed no ID cookies at all, but those are a minority.)
We also found that the cookie linking method is extremely robust and succeeds under a variety of conditions (Section 4.1). We considered how variations in cookie expiration dates, the size of the user’s history (i.e., the number of pages visited), and the types of pages visited affect the eavesdropper’s changes, and found the impact to be minimal. Perhaps most significantly, however, we found that this surveillance method can still link about 50% of a user’s history to the same pseudonymous ID even with just 25% of the current density of trackers on the web. This means that even if 75% of sites or trackers adopt mitigation strategies (such as deploying HTTPS), the eavesdropper still learns a lot.
Finally, we studied how an eavesdropper might learn the real-world identity behind a cluster of web pages associated with a pseudonymous ID. It turns out that this is surprisingly easy — many sites display real-world attributes such as real name, username, or email on unencrypted pages to logged in users, which means that the eavesdropper gets to see these identifiers. We conducted a survey of such leakage on popular sites, and found that over half of popular sites with account creation leak some form of real-world identity (Section 4.2).
While it’s no surprise that web traffic contains sensitive information about individuals, what we’ve shown is just how complete a profile can be extracted even if the user’s traffic is mixed with millions of other users. Further, an eavesdropper can connect these profiles to real-world identities without needing the co-operation of any websites. While HTTPS deployment by trackers can help, the only practical solution at the current time seems to be for users to install anti-tracking and anonymity tools.
 An exception is if the user routes traffic through Tor. Different requests can take different paths and the exit node IPs will be different. Thus, use of Tor with application-layer anonymization (e.g., Tor browser bundle) defeats our attack.
[Edit: minor edit for clarity.]