As you browse the internet, online advertisers track nearly every site you visit, amassing a trove of information on your habits and preferences. When you visit a news site, they might see you’re a fan of basketball, opera and mystery novels, and accordingly select ads tailored to your tastes. Advertisers use this information to create highly personalized experiences, but they typically don’t know exactly who you are. They observe only your digital trail, not your identity itself, and so you might feel that you’ve retained a degree of anonymity.
In new work with Ansh Shukla, Sharad Goel and Arvind Narayanan, we show that these anonymous web browsing records can in fact often be tied back to real-world identities. (Check out our demo, and see if we can figure out who you are.)
At a high level, our approach is based on a simple observation. Each person has a highly distinctive social network, comprised of family and friends from school, work, and various stages throughout one’s life. As a consequence, the set of links in your Facebook and Twitter feeds is likewise highly distinctive, and clicking on these links leaves a tell-tale mark in your browsing history.
Given only the set of web pages an individual has visited, we determine which social media feeds are most similar to it, yielding a list of candidate users who likely generated that web browsing history. In this manner, we can tie a person’s real-world identity to the near complete set of links they have visited, including links that were never posted on any social media site. This method requires only that one click on the links appearing in their social media feeds, not that they post any content.
Carrying out this strategy involves two key challenges, one theoretical and one engineering. The theoretical problem is quantifying how similar a specific social media feed is to a given web browsing history. One simple similarity measure is the fraction of links in the browsing history that also appear in the feed. This metric works reasonably well in practice, but it overstates similarity for large feeds, since those simply contain more links. We instead take an alternative approach. We posit a stylized, probabilistic model of web browsing behavior, and then compute the likelihood a user with that social media feed generated the observed browsing history. It turns out that this method is approximately equivalent to scaling the fraction of history links that appear in the feed by the log of the feed size.
The engineering challenge is identifying the most similar feeds in real time. Here we turn to Twitter, since Twitter feeds (in contrast to Facebook) are largely public. However, even though the feeds are public, we cannot simply create a local copy of Twitter against which we can run our queries. Instead we apply a series of heuristics to dramatically reduce the search space. We then combine caching techniques with on-demand network crawls to construct the feeds of the most promising candidates. On this reduced candidate set, we apply our similarity measure to produce the final results. Given a browsing history, we can typically carry out this entire process in under 60 seconds.
Our initial tests indicate that for people who regularly browse Twitter, we can deduce their identity from their web browsing history about 80% of the time. Try out our web application, and let us know if it works on you!