March 23, 2017

Archives for September 2016

The Effect of DNS on Tor’s Anonymity

This blog post is joint work with Benjamin Greschbach, Tobias Pulls, Laura M. Roberts, and Nick Feamster.

Counting almost two million daily users and 7,000 relays, the Tor network is the largest anonymity network operating today. The Tor Project is maintaining a privacy-enhanced version of the popular Firefox web browser—Tor Browser—that bounces its network traffic over the Tor network. By using Tor Browser, users can protect their web browsing from government surveillance, stalkers, advertisement-injecting ISPs, and nosy neighbors. Tor works by decoupling a user’s identity (i.e., the IP address, which reveals where in the world you are) from her activity (e.g., visiting Facebook). When using Tor, your ISP can no longer learn what websites you visit, but it does know that you are using Tor. The website you visit (e.g., Facebook) will also know that you are using Tor, but it will not learn your IP address. The EFF maintains a great interactive diagram, illustrating information leakage in several scenarios.

While the use of Tor constitutes a significant privacy gain over off-the-shelf web browsers, it is no panacea, and the Tor Project is upfront about its limitations. These limitations are not news to the research community. It is well understood that low-latency anonymity networks such as Tor cannot protect against so-called global passive adversaries. We define such adversaries as those with the ability to monitor both network traffic that enters and exits the network. Then the adversary can run a correlation attack, meaning that it can match packets that go into the network to packets that leave it, or in other words, it can link a client’s identity (her IP address) to her activity (e.g., visiting Facebook), and thus, break anonymity. By design, Tor cannot protect against these global passive adversaries. However, there is still merit in understanding which organizations are in a position to launch such attacks and how practical these attacks are. Only then can we gain an understanding of how much of a real-world threat these attacks constitute.

The research community has put a lot of effort into modeling correlation attacks and their effect on Tor users. To date, these studies have modeled mostly web traffic, i.e., HTTP over port 80 and HTTPS over port 443, which makes sense as web traffic likely constitutes the bulk of a typical Tor user’s online activity. However, there is more to network traffic than just web traffic. Before a browser can fetch a web page such as Facebook, it has to send out a DNS request to learn that the domain facebook.com is reachable via its IP address 173.252.120.68 (among others). DNS traffic is ubiquitous, both over and outside of Tor, so how does this relate to correlation attacks? DNS is a very different beast than HTTP, which is what most research has studied. HTTP traffic travels from a user’s exit relay directly to one or more web servers. DNS traffic, however, can traverse very different paths, depending on how exit relays (the relays in the Tor network that provide the link between Tor and the Internet) are configured. In particular, we find that DNS traffic traverses many networks that are entirely different than the networks that subsequent web traffic traverses. This means that past research likely underestimated the threat of correlation attacks because there are entities on the Internet such as ISPs, autonomous systems, or Internet exchange points that can monitor some DNS traffic but not web traffic coming out of the Tor network and potentially use the DNS traffic to deanonymize Tor users.

Overview of concept

Past traffic correlation studies have focused on linking the TCP stream entering the Tor network to the one(s) exiting the network. We show that an adversary can also link the associated DNS traffic, which can be exposed to many more autonomous systems than the TCP stream.

After many months of work, we have published a research paper on how DNS affects Tor’s anonymity. In our work, we show how an adversary can combine monitored DNS requests with well-understood website fingerprinting attacks to create a new type of DNS-enhanced correlation attack, or DNS-enhanced website fingerprinting attack, depending on how you look at it. We know of several organizations that are in a position to use these attacks, notably Google. Because it operates a popular open DNS resolver, the company can already observe many DNS requests. Additionally, Google can monitor some network traffic that is entering the Tor network: for example, via Google Fiber, via guard relays that are occasionally run in Google’s cloud, and formerly via meek app engine, which is now defunctOur results show that by monitoring DNS traffic, an attacker can enhance a website fingerprinting attack to be highly reliable. When our new, combined attack identifies a website, it is highly likely to be correct for a large number of websites on the Internet. Furthermore, we show that our new type of attack works well on websites that are infrequently visited over the entire Tor network globally. Presumably, this includes censored websites in different regions of the world and websites for whistleblowers and activists, who are in many ways Tor users in most need of protection. Beyond further motivating the deployment of website fingerprinting defenses in Tor, our attack has implications for the design of such defenses: they should probably be stronger than previously thought.

Our study also provides new insight into the current state of exit relays in the Tor network: We find that a significant fraction of exit relays send DNS requests to Google’s public resolvers. Google sees about onethird of DNS requests that exit from the Tor network—an alarmingly high fraction for a single company, particularly because Tor’s very design avoids centralized points of control and observation. Although Tor is reasonably decentralized, our work shows that this does not hold for the wider ecosystem that Tor exists in. We also simulate the Internet-scale implications of our work using the Tor Path Simulator (TorPS). This allows us to understand what-if scenarios such as the implications for Tor users if all exit relays were to run their own local resolvers. Our results show that while running local DNS resolvers on exit relays has its benefits, it also means that DNS requests are very exposed to network-level adversaries.

So what does our work mean for Tor users? We believe that there is no immediate cause for concern. Adversaries that can already monitor large fractions of the Internet—for many people, the biggest threat—will not do any better with our attack. Instead, we investigate how “semi-global” adversaries can get the most out of the data they have. Finally, the Tor Project is already working on techniques to make website fingerprinting attacks harder.

Our project page has more information about our study. Together with the research paper that is now under review, we also publish our code, data, and detailed instructions to replicate our work.

My testimony before the House Subcommittee on IT

I was invited to testify yesterday before the U.S. House of Representatives Subcommittee on Information Technology, at a hearing entitled “Cybersecurity: Ensuring the Integrity of the Ballot Box.”  My written testimony is available here.  My 5-minute opening statement went as follows:

My name is Andrew Appel.  I am Professor of Computer Science at Princeton University.   In this testimony I do not represent my employer. I’m here to give my own professional opinions as a scientist, but also as an American citizen who cares deeply about protecting our democracy.

My research is in software verification, computer security, technology policy, and election machinery.  As I will explain, I strongly recommend that, at a minimum, the Congress seek to ensure the elimination of Direct-Recording Electronic voting machines (sometimes called “touchscreen” machines), immediately after this November’s election; and that it require that all elections be subject to sensible auditing after every election to ensure that systems are functioning properly and to prove to the American people that their votes are counted as cast.

There are cybersecurity issues in all parts of our election system:  before the election, voter-registration databases; during the election, voting machines; after the election, vote-tabulation / canvassing / precinct-aggregation computers.  In my opening statement I’ll focus on voting machines.  The other topics are addressed in a recent report I have co-authored entitled “Ten Things Election Officials Can Do to Help Secure and Inspire Confidence in This Fall’s Elections.”

In the U.S. we use two kinds of voting machines: optical scanners that count paper ballots, and “touchscreen” voting machines, also called “Direct-Recording Electronic.”  Each voting machine is a computer, running a computer program.  Whether that computer counts the votes accurately, or makes mistakes, or cheats by shifting votes from one candidate to another, depends on what software is installed in the computer.  We all use computers, and we’ve all had occasion to install new software.  Sometimes it’s an app we purchase and install on purpose, sometimes it’s a software upgrade sent by the company that made our operating system.  Installing new software in a voting machine is not really much different from installing new software in any other kind of computer.

Installing new software is how you hack a voting machine to cheat. In 2009, in the courtroom of the Superior Court of New Jersey,  I demonstrated how to hack a voting machine.  I wrote a vote-stealing computer program that shifts votes from one candidate to another.   Installing that vote-stealing program in a voting machine takes 7 minutes, per machine, with a screwdriver.  I did this in a secure facility and I’m confident my program has not leaked out to affect real elections, but really the software I built was not rocket science — any computer programmer could write the same code.  Once it’s installed, it could steal elections without detection for years to come.

Voting machines are often delivered to polling places several days before the election—to elementary schools, churches, firehouses.  In these locations anyone could gain access to a voting machine for 10 minutes.  Between elections the machines are routinely opened up for maintenance by county employees or private contractors.  Let’s assume they have the utmost integrity, but still, in the U.S. we try to run our elections so that we can trust the election results without relying on any one individual.

Other computer scientists have demonstrated similar hacks on many models of machine. This is not just one glitch in one manufacturer’s machine, it’s the very nature of computers.

So how can we trust our elections when it’s so easy to make the computers cheat?  Forty states already know the answer:  vote on optical-scan paper ballots.  (My written testimony clarifies this statement.)  The voter fills in the bubble next to the name of their preferred candidate, then takes this paper ballot to the scanner—right there in the precinct—and feeds it in.  That opscan voting machine has a computer in it, and we can’t 100% prevent the computer from being hacked, but that very paper ballot marked by the voter drops into a sealed ballot box under the opscan machine.  Those ballots can be recounted by hand, in a way we can trust.

Unfortunately, there are still about 10 states that primarily use paperless touchscreen voting computers.  There’s no paper ballot to recount.  After the voter touches the screen, we have to rely on the computer—that is, we have to rely on whatever program is installed in the computer that day—to print out the true totals when the polls close.

So what must we do?  In the near term, we must not connect the voting machines to the Internet.  The same goes for those computers used to prepare the electronic ballot definition files before each election, that are used to program the voting machines—that is, we must not connect the voting machines even indirectly to the Internet.  Many able and competent election administrators already follow this “best practice.”  I hope that all 9000 counties and states that run elections follow this practice, and other security best practices, but it’s hard to tell whether they all do.

These and other best practices can help protect against hacking of voting machines by people in other countries through the Internet.  But they can’t protect us from mistakes, software bugs, miscalibration, insider hacking, or against local criminals with access to the machines before or after elections.  So what we must do as soon as possible after November is to adopt nationwide what 40 states have already done: paper ballots, marked by the voter, countable by computer but recountable by hand.

In 2000 we all saw what a disastrously unreliable technology those punch-card ballots were.  So in 2002 the Congress outlawed punch-card ballots, and that was very appropriate.  I strongly recommend that the Congress seek to ensure the elimination of paperless “touchscreen” voting machines, immediately after this November’s election.

Are you really anonymous online?

As you browse the internet, online advertisers track nearly every site you visit, amassing a trove of information on your habits and preferences. When you visit a news site, they might see you’re a fan of basketball, opera and mystery novels, and accordingly select ads tailored to your tastes. Advertisers use this information to create highly personalized experiences, but they typically don’t know exactly who you are. They observe only your digital trail, not your identity itself, and so you might feel that you’ve retained a degree of anonymity.

In new work with Ansh Shukla, Sharad Goel and Arvind Narayanan, we show that these anonymous web browsing records can in fact often be tied back to real-world identities. (Check out our demo, and see if we can figure out who you are.)

At a high level, our approach is based on a simple observation. Each person has a highly distinctive social network, comprised of family and friends from school, work, and various stages throughout one’s life. As a consequence, the set of links in your Facebook and Twitter feeds is likewise highly distinctive, and clicking on these links leaves a tell-tale mark in your browsing history.

Given only the set of web pages an individual has visited, we determine which social media feeds are most similar to it, yielding a list of candidate users who likely generated that web browsing history. In this manner, we can tie a person’s real-world identity to the near complete set of links they have visited, including links that were never posted on any social media site. This method requires only that one click on the links appearing in their social media feeds, not that they post any content.

Carrying out this strategy involves two key challenges, one theoretical and one engineering. The theoretical problem is quantifying how similar a specific social media feed is to a given web browsing history. One simple similarity measure is the fraction of links in the browsing history that also appear in the feed. This metric works reasonably well in practice, but it overstates similarity for large feeds, since those simply contain more links. We instead take an alternative approach. We posit a stylized, probabilistic model of web browsing behavior, and then compute the likelihood a user with that social media feed generated the observed browsing history. It turns out that this method is approximately equivalent to scaling the fraction of history links that appear in the feed by the log of the feed size.

The engineering challenge is identifying the most similar feeds in real time. Here we turn to Twitter, since Twitter feeds (in contrast to Facebook) are largely public. However, even though the feeds are public, we cannot simply create a local copy of Twitter against which we can run our queries. Instead we apply a series of heuristics to dramatically reduce the search space. We then combine caching techniques with on-demand network crawls to construct the feeds of the most promising candidates. On this reduced candidate set, we apply our similarity measure to produce the final results. Given a browsing history, we can typically carry out this entire process in under 60 seconds.

Our initial tests indicate that for people who regularly browse Twitter, we can deduce their identity from their web browsing history about 80% of the time. Try out our web application, and let us know if it works on you!