March 6, 2016

avatar

Apple/FBI: Freedom of speech vs. compulsion to sign

This week I signed the Electronic Frontier Foundation’s amicus (friend-of-the-court) brief in the Apple/FBI  iPhone-unlocking lawsuit.  Many prominent computer scientists and cryptographers signed: Josh Aas, Hal Abelson, Judy Anderson, Andrew Appel, Tom Ball (the Google one, not the Microsoft one), Boaz Barak, Brian Behlendorf, Rich Belgard, Dan Bernstein, Matt Bishop, Josh Bloch, Fred Brooks, Mark Davis, Jeff Dean, Peter Deutsch, David Dill, Les Earnest, Brendan Eich, David Farber, Joan Feigenbaum, Michael Fischer, Bryan Ford, Matt Franklin, Matt Green, Alex Halderman, Martin Hellman, Nadia Heninger, Miguel de Icaza, Tanja Lange, Ed Lazowska, George Ledin, Patrick McDaniel, David Patterson, Vern Paxson, Thomas Ristenpart, Ron Rivest, Phillip Rogaway, Greg Rose, Guido van Rossum, Tom Shrimpton, Barbara Simons, Gene Spafford, Dan Wallach, Nickolai Zeldovich, Yan Zhu, Phil Zimmerman. (See also the EFF’s blog post.)

The technical and legal argument is based on the First Amendment: (1) Computer programs are a form of speech; (2) the Government cannot compel you to “say” something any more than it can prohibit you from expressing something.  Also, (3) digital signatures are a form of signature; (4) the government cannot compel or coerce you to sign a statement that you don’t believe, a statement that is inconsistent with your values.  Each of these four statements has ample precedent in Federal law.  Combined together, (1) and (2) mean that Apple cannot be compelled to write a specific computer program.  (3) and (4) mean that even if the FBI wrote the program (instead of forcing Apple to write it), Apple could not be compelled to sign it with its secret signing key.  The brief argues,

By compelling Apple to write and then digitally sign new code, the Order forces Apple to first write a message to the government’s specifications, and then adopt, verify and endorse that message as its own, despite its strong disagreement with that message. The Court’s Order is thus akin to the government dictating a letter endorsing its preferred position and forcing Apple to transcribe it and sign its unique and forgery-proof name at the bottom.

There are millions of iPhones that rely on Apple’s considered opinion about whether it’s a good idea to install a software update.  Or, if you like, there are millions of iPhone owners who rely on Apple’s considered opinion about what software updates are safe and advisable to install on their phones, and who have deliberately purchased phones that implement this personal reliance by the mechanism of public-key authentication.  The FBI seeks to force Apple to sign a statement in contradiction with its values.  But “compelled speech doctrine prevents the government from forcing its citizens [and corporations] to be hypocrites.”

Regarding whether (1) computer programs are a form of speech, the EFF’s brief cites three previous Federal cases where appellate courts have upheld this principle:

It is long settled that computer code, including the code that makes up Apple’s iOS operating system and its security features including encryption, is a form of protected speech under the First Amendment. Universal City Studios, Inc. v. Corley (2d Cir. 2001); Junger v. Daley (6th Cir. 2000); Bernstein v. DOJ (9th Cir. 1999).  Code consistently receives First Amendment protection because code, like a written musical score, “is an expressive means for the exchange of information and ideas.” (Junger, 209 F.3d at 484.) In Corley, which similarly considered code that could be used to undermine security, the Second Circuit held that “[c]ommunication does not lose constitutional protection as ‘speech’ simply because it is expressed in the language of computer code. Mathematical formulae and musical scores are written in ‘code,’ i.e., symbolic notations not comprehensible to the uninitiated, and yet both are covered by the First Amendment.” (273 F.3d at 445–46.) [citations abridged.]

I am proud to have participated in each of these three cases.  In Bernstein v. Daley I wrote this declaration in 1996 (a form of written testimony about facts), saying, “yes, we scholars publish computer programs just like we publish natural-language papers; I have been publishing open-source software since 1988.”  In Junger v. Daley I wrote this similar declaration in 1997.  These two cases were the foundation of the 1990s victories in “freedom of cryptography” in the United States.

In Universal City Studios, Inc. v. Corley I not only wrote a declaration, I testified as a witness in Federal Court in New York. The “DMCA police” mostly won that case, unfortunately, but even so, the judges at both levels (district court and appellate court) agreed that source code is speech.

Professor Richards has argued that the “Code=Speech” argument is problematic, that this is not the best argument in favor of Apple’s position.  He says, even if Code is Speech, there are other restrictions on speech that (the courts have held) are consistent with freedom of expression.  But he does agree with the digital-signature argument:

“making Apple write and disseminate this particular code—making it lie to one of its customers—could be seen as a kind of compelled speech, and that could violate the First Amendment. But this wouldn’t be the case merely because (as Apple assumes) Code = Speech and the FBI was compelling the creation of code. Instead, it would be offensive to the First Amendment because the particular act being compelled—authenticating a security update as true when it was false—would be compelled false communication in a relationship of trust. The law on this narrow question is underdeveloped, but it could allow Apple to win on free speech grounds.”

Other amicus briefs in support of Apple focus on other important and excellent arguments.  The brief from Stanford, signed by Dino Dai Zovi, Dan Boneh, Charlie Miller, Hovav Shacham, Bruce Schneier, Dan Wallach and Jonathan Zdziarski, argues that “[f]orcing device manufacturers to create forensic capabilities for U.S. investigators creates security risks” and “security breaches are all but certain when law mandates government access.”  I agree completely with this brief as well (but one can’t sign every brief!).

The ACLU’s brief argues, “The All Writs Act does not authorize the government” to compel Apple to do this, in part because “Congress has deliberately withheld the authority sought here;” and that “the order the government seeks violates the Fifth Amendment.”  I quite agree with the ACLU, of which I have been a member since 1988; but I am a computer scientist, not a lawyer, so on that amicus brief my computer-security and computer-science expertise does not specifically apply.

 

 

avatar

What Your ISP (Probably) Knows About You

Earlier this week, I came across a working paper from Professor Peter Swire—a highly respected attorney, professor, and policy expert.  Swire’s paper, entitled “Online Privacy and ISPs“, argues that ISPs have limited capability to monitor users’ online activity. The paper argues that ISPs have limited visibility into users’ online activity for three reasons:  (1) users are increasingly using many devices and connections, so any single ISP is the conduit of only a fraction of a typical user’s activity; (2) end-to-end encryption is becoming more pervasive, which limits ISPs’ ability to glean information about user activity; and (3) users are increasingly shifting to VPNs to send traffic.

An informed reader might surmise that this writeup relates to the reclassification of Internet service providers under Title II of the Telecommunications Act, which gives the FCC a mandate to protect private information that ISPs learn about their customers. This private information includes both personal information, as well as information about a customer’s use of the service that is provided as a result of receiving service—sometimes called Customer Proprietary Network Information, or CPNI. One possible conclusion a reader might draw from this white paper is that ISPs have limited capability to learn information about customers’ use of their service and hence should not be subject to additional privacy regulations.

I am not taking a position in this policy debate, nor do I intend to make any normative statements about whether an ISP’s ability to see this type of user information is inherently “good” or “bad” (in fact, one might even argue that an ISP’s ability to see this information might improve network security, network management, or other services). Nevertheless, these debates should be based on a technical picture that is as accurate as possible.  In this vein, it is worth examining Professor Swire’s “factual description of today’s online ecosystem” that claims to offer the reader an “up-to-date and accurate understanding of the facts”. It is true that the report certainly contains many facts, but it also omits important details about the “online ecosystem”. Below, I fill in what I see as some important missing pieces. Much of what I discuss below I have also sent verbatim in a letter to the FCC Chairman. I hope that the original report will ultimately incorporate some of these points.

Claim 1: User Mobility Prevents a Single ISP from Observing a Significant Fraction of User Traffic

The report’s claim: Due to increased mobility, users are increasingly using many devices and connections, so any single ISP is the conduit of only a fraction of a typical user’s activity.

A missing piece: A single ISP can still track significant user activities from home network traffic and (as the user moves) through WiFi sharing services.

The report cites statistics from Cisco on the increase in mobile devices; these statistics do not offer precise information about how user traffic distributes across ISPs, but it’s reasonable to assume that users who are more mobile are not sending all of their traffic over a single access ISP.

Yet, a user’s increased mobility by no means implies that a single ISP cannot track users’ activities in their homes. Our previous research has shown that the traffic that users send in their home networks—typically through a single ISP—reveals significant information about user activity and behavior. The median home had about five connected devices at any give time. Simply by observing traffic patterns (i.e., without looking at any packet contents), we could determine the types of devices that users had in their homes, as well as how often (and how heavily) they used each device. In some cases, we could even determine when the user was likely to be home, based on diurnal traffic usage patterns. We could determine the most popular domains that each home visited.  The figure below shows examples of such popular domains.

Lots to learn from home network traffic. This example from our previous work shows an example of what an ISP can learn from network traffic popular domains from 25 home networks. The number of homes for which each domain appeared in the top 5 or top 10 most popular domains for that home.

 

Based on what we can observe from this traffic, it should come as no surprise that the data that we gathered—which is the same data an ISP can see—warrants special handling, due to its private nature. University Institutional Review Boards (IRBs) consider this type of work human subjects research because it “obtains (1) data through intervention or interaction with the individual; or (2) private, identifiable information”; indeed, we had to get special approval to even perform this study in the first place.

The report claims that user mobility may make it more difficult for a single ISP to track a user’s activity because a mobile user is more likely to connect through different ISPs.  But, another twist in this story makes me think that this deserves more careful examination: the rise of shared WiFi hotspots—such as Xfinity WiFi, which had deployed 10 million WiFi hotspots as of mid-2015, and which users had accessed 3.6 billion times—in some cases allow a single ISP to track mobile users more than they otherwise would be able to without such a service.

Incidentally, the report also says that “limited processing power and storage placed technical and cost limits [deep-packet inspection] capability”, but in the last mile, data-rates are substantially lower and can thus permit DPI.  For example, we had no trouble gathering all of the traffic data for our research on a simple, low-cost Netgear router running OpenWrt. : Most home networks we have studied are sending traffic at only tens of megabits per second, even at peak rate. We have been able to perform packet capture on very resource-limited devices at these rates.

Claim 2: End-to-End Encryption Limits ISP Visibility into User Behavior

The report’s claim: End-to-end encryption on websites is increasingly pervasive; thus, ISPs have limited visibility into user behavior .

A missing piece: ISPs can observe user activity based on general traffic patterns (e.g., volumes), unencrypted portions of communication, and the large number of in-home devices that do not encrypt traffic.

Nearly all Internet-connected devices use the Domain Name System (DNS) to look up domain names for specific Internet destinations. These DNS lookups are generally “in the clear” (i.e., unencrypted) and can be particularly revealing. For example, we conducted a recent study of traffic patterns from a collection of IoT devices; in that study, we observed, for example, that a Nest thermostat routinely performs a DNS lookup to frontdoor.nest.com, a popular digital photo frame routinely issued DNS queries to api.pix-star.com, and a popular IP camera routinely issued DNS queries to (somewhat ironically!) sharxsecurity.com. No sophisticated traffic analysis was required to identify the usage of these devices from plaintext DNS query traffic.

Even when a site uses HTTPS to communicate with an Internet destination, the initial TLS handshake typically indicates the hostname that it is communicating with using the Server Name Indication (SNI), which allows the server to present the client with the appropriate certificate for the corresponding domain that the client is attempting to communicate with. The SNI is transmitted in cleartext and naturally reveals information about the domains that a user’s devices are communicating with.

The report cites the deployment of HTTPS on many major websites as evidence that traffic from consumers is increasingly encrypted end-to-end. Yet, consumer networks are increasingly being equipped with Internet of Things (IoT) devices, many of which we have observed send traffic entirely in cleartext. In fact, of the devices we have studied, cleartext communication was the norm, not the exception. While of course, we all hope that many of these devices ultimately shift to using encrypted communications in the future, the current state of affairs is much different. Even in the longer term, it is possible that certain IoT devices may be so resource-limited as to make cryptography impractical, particularly in the case of low-cost IoT devices. The deployment of HTTPS on major websites is certainly encouraging for the state of privacy on the Internet in general, but it is a poor indicator for how much traffic from a home network is encrypted.

Claim 3: Users are Increasingly Using VPNs, Which Conceal User Activity from ISPs

The report’s claim: Users’ increasing use of VPNs encrypt all traffic, including DNS, as traffic traverses the ISP; therefore, ISPs cannot see any user traffic.

A missing piece: DNS traffic sometimes goes to the ISP’s DNS server after it exits the VPN tunnel. Configuring certain devices to use VPNs may not be straightforward for many users.

Whether VPNs will prevent ISPs from seeing DNS traffic depends on the configuration of the VPN tunnel. A VPN is simply an encrypted tunnel that takes the original IP packet and encapsulates the packet in a new packet whose destination IP address is the tunnel endpoint. But, the IP address for DNS resolution is typically set by the Dynamic Host Configuration Protocol (DHCP). If the consumer uses the ISP’s DHCP server to configure the host in question (which most of us do), the client’s DNS server will still be the ISP’s DNS server, unless the client’s VPN software explicitly reconfigures the DNS server (many VPN clients do not).

In these cases, the ISP will still continue to observe all of the user’s DNS traffic, even if the user configures a VPN tunnel: the DNS queries will exit the VPN tunnel and head right back to the ISP’s DNS server. It is often for a user to configure a device to not use the ISP’s DNS server, but this is by no means automatic and in certain cases (e.g., on IoT devices) it may be quite difficult. Even in cases where a VPN uses its own DNS resolver, the traffic for those queries by no means stay local: DNS cache misses can cause these queries to traverse many ISPs.

Traffic from VPNs doesn’t simply disappear: it merely resurfaces in another ISP that can subsequently monitor user activity. The opportunities for observing user traffic are substantial. For example, in a recent simple experiment that postdoc Philipp Winter performed, web requests from Tor exit relays to the Alexa top 1,000 websites traversed more than 350 Internet service providers considering the DNS lookups from these exit relays, the traffic from these exit nodes traverses an additional 173 Internet service providers.

Furthermore, VPN clients are typically for desktop machines and, in some cases, mobile devices such as phones and tablets. As previously discussed, IoT devices in homes will continue to generate more traffic. Most such devices do not support VPN software. While it is conceivable that a user could set up an encrypted VPN tunnel from the home router and route all home traffic through a VPN, typical home gateways don’t easily support this functionality at this point, and configuring such a setup would be cumbersome for the typical user.

Conclusion

Policymakers, industry, and consumers should debate whether, when, and how the FCC should impose privacy protections for consumers. Robust debate, however, needs an understanding of the technical underpinnings that is complete as possible. In this post, I have attempted to fill in what struck me as some missing pieces in Professor Swire’s discussion of ISPs’ ability to observe user activity in network traffic. The report implies that ISPs’ access to information about users’ online activity is neither “comprehensive” nor “unique”. Yet, an ISP is in the position to see user traffic to much more user traffic from many more devices than other parties in the Internet ecosystem—and certainly much more than the paper would have the reader conclude. I hope that the original working paper is revised to reflect a more complete and balanced view of ISPs’ capabilities.

avatar

An analogy to understand the FBI’s request of Apple

After my previous blog post about the FBI, Apple, and the San Bernadino iPhone, I’ve been reading many other bloggers and news articles on the topic. What seems to be missing is a decent analogy to explain the unusual nature of the FBI’s demand and the importance of Apple’s stance in opposition to it. Before I dive in, it’s worth understanding what the FBI’s larger goals are. Cyrus Vance Jr., the Manhattan DA, states it clearly: “no smartphone lies beyond the reach of a judicial search warrant.” That’s the FBI’s real goal. The San Bernadino case is just a vehicle toward achieving that goal. With this in mind, it’s less important to focus on the specific details of the San Bernadino case, the subtle improvements Apple has made to the iPhone since the 5c, or the apparent mishandling of the iCloud account behind the San Bernadino iPhone.

Our Analogy: TSA Luggage Locks

When you check your bags in the airport, you may well want to lock them, to keep baggage handlers and other interlopers from stealing your stuff. But, of course, baggage inspectors have a legitimate need to look through bags. Your bags don’t have any right of privacy in an airport. To satisfy these needs, we now have “TSA locks”. You get a combination you can enter, and the TSA gets their own secret key that allows airport staff to open any TSA lock. That’s a “backdoor”, engineered into the lock’s design.

What’s the alternative? If you want the TSA to have the technical capacity to search a large percentage of bags, then there really isn’t an alternative. After all, if we used “real” locks, then the TSA would be “forced” to cut them open. But consider the hypothetical case where these sorts of searches were exceptionally rare. At that point, the local TSA could keep hundreds of spare locks, of all makes and models. They could cut off your super-duper strong lock, inspect your bag, and then replace the cut lock with a brand new one of the same variety. They could extract the PIN or key cylinder from the broken lock and install it in the new one. They could even rough up the new one so it looks just like the original. Needless to say, this would be a specialized skill and it would be expensive to use. That’s pretty much where we are in terms of hacking the newest smartphones.

Another area where this analogy holds up is all the people who will “need” access to the backdoor keys. Who gets the backdoor keys? Sure, it might begin with the TSA, but every baggage inspector in every airport, worldwide, will demand access to those keys. And they’ll even justify it, because their inspectors work together with ours to defeat smuggling and other crimes. We’re all in this together! Next thing you know, the backdoor keys are everywhere. Is that a bad thing? Well, the TSA backdoor lock scheme is only as secure as their ability to keep the keys a secret. And what happened? The TSA mistakenly allowed the Washington Post to publish a photo of all the keys, which makes it trivial for anyone to fabricate those keys. (CAD files for them are now online!) Consequently, anybody can take advantage of the TSA locks’ designed-in backdoor, not just all the world’s baggage inspectors.

For San Bernadino, the FBI wants Apple to retrofit a backdoor mechanism where there wasn’t one previously. The legal precedent that the FBI wants creates a capability to convert any luggage lock into a TSA backdoor lock. This would only be necessary if they wanted access to lots of phones, at a scale where their specialized phone-cracking team becomes too expensive to operate. This no doubt becomes all the more pressing for the FBI as modern smartphones get better and better at resisting physical attacks.

Where the analogy breaks down: If you travel with expensive stuff in your luggage, you know well that those locks have very limited resistance to an attacker with bolt cutters. If somebody steals your luggage, they’ll get your stuff, whereas that’s not necessarily the case with a modern iPhone. These phones are akin to luggage having some kind of self-destruct charge inside. You force the luggage open and the contents will be destroyed. Another important difference is that much of the data that the FBI presumably wants from the San Bernadino phone can be gotten elsewhere, e.g., phone call metadata and cellular tower usage metadata. We have very little reason to believe that the FBI needs anything on that phone whatsoever, relative to the mountain of evidence that it already has.

Why this analogy is important: The capability to access the San Bernadino iPhone, as the court order describes it, is a one-off thing—a magic wand that converts precisely one traditional luggage lock into a TSA backdoor lock, having no effect on any other lock in the world. But as Vance makes clear in his New York Times opinion, the stakes are much higher than that. The FBI wants this magic wand, in the form of judicial orders and a bespoke Apple engineering process, to gain backdoor access to any phone in their possession. If the FBI can go to Apple to demand this, then so can any other government. Apple will quickly want to get itself out of the business of adjudicating these demands, so it will engineer in the backdoor feature once and for good, albeit under duress, and will share the necessary secrets with the FBI and with every other nation-state’s police and intelligence agencies. In other words, Apple will be forced to install a TSA backdoor key in every phone they make, and so will everybody else.

While this would be lovely for helping the FBI gather the evidence it wants, it would be especially lovely for foreign intelligence officers, operating on our shores, or going after our citizens when they travel abroad. If they pickpocket a phone from a high-value target, our FBI’s policies will enable any intel or police organization, anywhere, to trivially exercise any phone’s TSA backdoor lock and access all the intel within. Needless to say, we already have a hard time defending ourselves from nation-state adversaries’ cyber-exfiltration attacks. Hopefully, sanity will prevail, because it would be a monumental error for the government to require that all our phones be engineered with backdoors.

avatar

Apple, the FBI, and the San Bernadino iPhone

Apple just posted a remarkable “customer letter” on its web site. To understand it, let’s take a few steps back.

In a nutshell, one of the San Bernadino shooters had an iPhone. The FBI wants to root through it as part of their investigation, but they can’t do this effectively because of Apple’s security features. How, exactly, does this work?

  • Modern iPhones (and also modern Android devices) encrypt their internal storage. If you were to just cut the Flash chips out of the phone and read them directly, you’d learn nothing.
  • But iPhones need to decrypt that internal storage in order to actually run software. The necessary cryptographic key material is protected by the user’s password or PIN.
  • The FBI wants to be able to exhaustively try all the possible PINs (a “brute force search”), but the iPhone was deliberately engineered with a “rate limit” to make this sort of attack difficult.
  • The only other option, the FBI claims, is to replace the standard copy of iOS with something custom-engineered to defeat these rate limits, but an iPhone will only accept an update to iOS if it’s digitally signed by Apple. Consequently, the FBI convinced a judge to compel Apple to create a custom version of iOS, just for them, solely for this investigation.
  • I’m going to ignore the legal arguments on both sides, and focus on the technical and policy aspects. It’s certainly technically possible for Apple to do this. They could even engineer their customized iOS build to measure the serial number of the iPhone on which it’s installed, such that the backdoor would only work on the San Bernadino suspect’s phone, without being a general-purpose skeleton key for all iPhones.

With all that as background, it’s worth considering a variety of questions.

Does the FBI’s investigation actually need access to the internals of the iPhone in question?

Apple’s letter states:

When the FBI has requested data that’s in our possession, we have provided it. Apple complies with valid subpoenas and search warrants, as we have in the San Bernardino case. We have also made Apple engineers available to advise the FBI, and we’ve offered our best ideas on a number of investigative options at their disposal.

In Apple’s FAQ on iCloud encryption, they describe how most iCloud features are encrypted both in transit and at rest, with the notable exception of email. So, if the San Bernadino suspect’s phone used Apple’s mail services, then the FBI can read that email. It’s possible that Apple genuinely cannot provide unencrypted access to other data in iCloud without the user’s passwords, but it’s also possible that the FBI could extract the necessary passwords (or related authentication tokens) from other places, like the suspect’s laptop computer.

Let’s assume, for the sake of discussion, that the FBI has not been able to get access to anything else on the suspect’s iPhone or its corresponding iCloud account, and they’ve exhausted all of their technical avenues of investigation. If the suspect used Gmail or some other service, let’s assume the FBI was able to get access to that as well. So what might they be missing? SMS / iMessage. Notes. Photos. Even knowing what other apps the user has installed could be valuable, since many of them have corresponding back-end cloud services, chock full of tasty evidence. Of course, the suspects emails and other collected data might already make for a compelling case against them. We don’t know.

Could the FBI still find a way into their suspect’s iPhone?

Almost certainly yes. Just yesterday, the big news was a security critical bug in glibc that’s been around since 2008. And for every bug like this that the public knows about, our friends in the government have many more that they keep to themselves. If the San Bernadino suspect’s phone is sufficiently valuable, then it’s time to reach into the treasure chest (both figuratively and literally) and engineer a custom exploit. There’s plenty of attack surface available to them. That attack surface stretches to the suspect’s personal computers and other devices.

The problem with this sort of attack plan is that it’s expensive, it’s tricky, and it’s not guaranteed to work. Since long before the San Bernadino incident, the FBI has wanted a simpler solution. Get a legal order. Get access. Get evidence. The San Bernadino case clearly spells this out.

What’s so bad about Apple doing what the FBI wants?

Apple’s concern is the precedent set by the FBI’s demand and the judge’s order. If the FBI can compel Apple to create a backdoor like this, then so can anybody else. You’ve now opened the floodgates to every small-town police chief, never mind discovery orders in civil lawsuits. How is Apple supposed to validate and prioritize these requests? What happens when they come from foreign governments? If China demands a custom software build to attack a U.S. resident, how is Apple supposed to judge whether that user and their phone happen to be under the jurisdiction of Chinese law? What if the U.S. then passes a law prohibiting Apple from honoring Chinese requests like this? That way lies madness, and that’s where we’re going.

Even if we could somehow make this work, purely as an engineering matter, it’s not feasible to imagine a backdoor mechanism that will support the full gamut of seemingly legal requests to exercise it.

Is backdoor engineering really feasible? What are the tradeoffs?

If there’s anything that the computer security community has learned over the years, it’s that complexity is the enemy of security. One highly relevant example is SSL/TLS support for “export-grade cryptography” — a bad design left over from the 1990’s when the U.S. government tried to regulate the strength of cryptographic products. Last year’s FREAK attack boils down to an exploit that forces SSL/TLS connections to operate with degraded key quality. The solution? Remove all export-grade cipher suites from SSL/TLS implementations, since they’re not used and not needed any more.

The only way that we know how to build secure software is to make it simple, to use state of the art techniques, and to get rid of older feature that we know are weak. Backdoor engineering is the antithesis of this process.

What are appropriate behaviors for an engineering organization like Apple? I’ll quote Google’s Eric Grosse:

Eric Grosse, Google’s security chief, suggested in an interview that the N.S.A.’s own behavior invited the new arms race.

“I am willing to help on the purely defensive side of things,” he said, referring to Washington’s efforts to enlist Silicon Valley in cybersecurity efforts. “But signals intercept is totally off the table,” he said, referring to national intelligence gathering.

“No hard feelings, but my job is to make their job hard,” he added.

As a national policy matter, we need to decide what’s more important: backdoor access to user data, or robustness against nation-state adversaries. If you want backdoor access, then the cascade of engineering decisions that will be necessary to support those backdoors will fundamentally weaken our national security posture. On the flip side, strong defenses are strong against all adversaries, including the domestic legal system.

Indeed, the FBI and other law enforcement agencies will need to come to terms with the limits of their cyber-investigatory powers. Yes the data you want is out there. No, you can’t get what you want, because cyber-defense must be a higher priority.

What are the alternatives? Can the FBI make do without what it’s asking?

How might the FBI cope in a world where Apple, Google, and other engineering organizations build walls that law enforcement cannot breach? I suspect they’ll do just fine. We know the FBI has remarkably good cyber-investigators. For example, the FBI hacked “approximately 1300” computers as part of a child pornography investigation. Likewise, even if phone data is encrypted, the metadata generated just walking around with a phone is amazing. For example, researchers discovered that

data from just four, randomly chosen “spatio-temporal points” (for example, mobile device pings to carrier antennas) was enough to uniquely identify 95% of the individuals, based on their pattern of movement.

In other words, even if you use “burner” phones, investigators can connect them together based on your patterns of movement. With techniques like this, the FBI has access to a mountain of data on their San Bernadino suspects, far more than they ever might have collected in the era before smartphones.

In short, the FBI’s worries that targets of its investigations are “going dark” are simply not credible, and their attempts to coopt technology companies into give them back doors are working against our national interests.

avatar

How Does Zero-Rating Affect Mobile Data Usage?

On Monday, the Telecom Regulatory Authority of India (TRAI) released a decision that effectively bans “zero-rated” Internet services in the country. While the notion of zero-rating might be somewhat new to many readers in the United States, the practice is common in many developing economies. Essentially, it is the practice by which a carrier creates an arrangement whereby its customers are not charged normal data rates for accessing certain content.

High-profile instances of zero-rating include Facebook’s “Free Basics” (formerly “Internet.org“) and Wikipedia Zero. But, many readers might be surprised to learn that the practice is impressively widespread. Although comprehensive documentation is hard to come by, experience and conventional wisdom affirm that mobile data carriers in regions across the world regularly partner with mobile data providers to provide services that are effectively free to the consumer, and these offerings tend to change frequently.

I experienced zero-rating first-hand on a trip to South Africa last summer. While on a research trip there, I learned that Cell C, a mobile telecom provider, had partnered with Internet.org to offer its subscribers free access to a limited set of sites through the Internet.org mobile application. I immediately wondered whether a citizen’s socioeconomic class could affect Internet usage—and, as a consequence, their access to information.

Zero-rating evokes a wide range of (strong) opinions (emphasis on “opinion”). Mark Zuckerberg would have us believe that Free Basics is a way to bring the Internet to the next billion people, where the alternative might be that this demographic might not have access to the Internet at all. This, of course, presumes that we equate “access to Facebook” with “access to the Internet”—something which at least one study has shown can occur (and is perhaps even more cause for concern). Others have argued that zero-rated services violate network neutrality principles and could also result in the creation of walled gardens where citizens’ Internet access might be brokered by a few large and powerful organizations.

And yet, while the arguments on zero-rating are loud, emotional, and increasingly higher-stakes, these opinions on either side have yet to be supported by any actual data.

We Must Bring Data to this Debate

Unfortunately, there is essentially  no data concerning the central question of how users adjust their behavior in response to mobile data pricing practices. Erik Stallman’s eloquent post today on the TRAI ruling and the Center for Democracy and Technology‘s recent white paper on zero-rating both lament the lack of data on either side of the debate.

I want to change that. To this end, as Internet measurement researchers and policy-interested computer scientists, we are starting to bring some data to this debate—although we still have a long way to go.

As luck would have it, we had already been gathering some data that shed some light on this question. In 2013, we developed a mobile performance measurement application, My Speed Test, which performs speed test measurements of a user’s mobile network, but also gathers information about a user’s application usage, and whether that usage occurs on the cellular data network or on a Wi-Fi network. My Speed Test has been installed on thousands of phones in several hundred countries around the world over this three-year period. In addition to a significant based of installations in the United States, we had several hundred users running the application in South Africa, due to a study of mobile network performance that we performed in the country a couple of years ago.

This deployment gave us a unique opportunity to study the application usage patterns of a group of users, across a wide range of carriers, across countries, over three years. It allowed us to look at usage patterns in the United States (where many users are on pre-paid plans) to South Africa (where most users are on pre-paid, pay-as-you-go plans).  It also allowed us to look at how users responded to zero-rated services in South Africa. A superstar undergraduate student, Ava Chen, led this research in collaboration with Enrico Calandro at Research ICT Africa and Sarthak Grover, a Ph.D. student here at Princeton. I briefly summarize some of Ava’s results below.

The results of this study are preliminary. More widespread deployment of My Speed Test would ultimately allow us to gather more data and draw more conclusive results. We can use your help by spreading the word about our work, and My Speed Test.

Effects of Zero Rating on Usage

We explored the extent to which the zero-rating offerings of various South African carriers affected usage patterns for different applications. During our data-collection period, mobile data provider Cell C offered its customers several zero-rating packages:

  • From November 19, 2014 to August 31, 2015, Cell C zero-rated WhatsApp. From September 1, 2015 until now, Cell C adopted a bundle offer where, for a fee of ZAR 5 (about $0.30), users could use up to 1 GB on WhatsApp for 30 days, including voice calls.
  • On July 1, 2015, Cell C began zero-rating Facebook’s Free Basics service.
  • On two separate occasions—May 1–July 31, 2014 and August 1, 2014–February 13, 2015—MTN zero-rated Twitter.

We aimed to determine whether users adjusted their mobile behavior in response to these various pricing promotions. We found the following trends:

Cell C users increased WhatsApp usage by more than a factor of three on both cellular and Wi-Fi. The average monthly user for Cell C increased WhatsApp usage on the cellular network by a factor of three, from about 7 MB per month to about 22 MB per month on average. Interestingly, not only did the usage of WhatsApp on the cellular network increased, usage also increased on Wi-Fi networks, by more than a factor of seven—to about 17 MB per month. Interestingly, users still used WhatsApp more on the cellular data network than on Wi-Fi.

Cell C usage on WhatsApp increased on both WiFi and cellular in response to zero-rating a zero-rating offering from the carrier.

Twitter usage on MTN increased in response to zero-rating.  We limited our analysis of MTN’s zero-rating practices to 2014, because we did not have enough data to draw conclusive results from the second period. Our analysis of the 2014 period, however, found that  aside from the holiday season (when Twitter traffic is known to spike due to shopping promotions), the second most significant spike in usage on MTN occurred during the period from May through July 2014 when the zero-rating promotion was in effect. During this period the average Twitter user on MTN exchanged as much as 40 MB per day on Twitter, whereas usage outside of the promotional period was typically closer to about 10 MB per day.

Other Responses to Mobile Data Pricing

Mobile users in the United States use more mobile data, on both cellular and Wi-Fi. Mobile users in both the United States and South Africa used YouTube and Facebook extensively; other applications were more specific to country. We noticed some interesting trends. First, when looking at the total data usage for these applications in each country, the median user in the United States tended to use more data per month, not only on the cellular data network but also on Wi-Fi networks. It is understandable that South African users would be far more conservative with their use of cellular data; previous studies have noted this effect. It is remarkable, however, that these users also were more conservative with their data usage on Wi-Fi networks; this effect could also be explained that even Wi-Fi and wired Internet connections are still considerably more expensive (and more of a luxury good) than they seem to be in the United States. In contrast, users in the United States not only used more data in general, but they often used more data on average on cellular data networks than on Wi-Fi—perhaps as a result of the fact that users in the US were much less sensitive to the cost of mobile data than those in South Africa.

Mobile users in South Africa exchanged significantly more Facebook traffic than streaming video traffic—even when on cellular data plans. Given the high cost of cellular data in South Africa, we expected that users would be conservative with mobile data usage in general. Although our findings mostly confirmed this, Facebook was a notable exception: Not only did the typical user consume significantly more traffic using Facebook than with streaming video, users also exchanged more Facebook traffic over the cellular network than they did on Wi-Fi networks. This behavior suggests that Facebook usage is dominant to the extent that users appear to be more willing to pay for relatively expensive mobile data to use it than they are for other applications.

Summary and Request for Help

Our preliminary evidence suggests that zero-rated pricing structures may have an effect on usage of an application—not only on the cellular network where pricing instruments are implemented, but also in general. However, we need more data to draw stronger conclusions. We are actively seeking collaborations to help us deploy My Speed Test on a larger scale, to facilitate a larger-scale analysis.

To this end, we are excited to announce a collaboration with the Alliance for an Affordable Internet (A4AI) to use My Speed Test to study these effects in other countries on a larger scale. We are interested in gathering more widespread longitudinal data on this topic, through both organic installations of the application or studies with targeted recruitment.

Please let me know if you would like to help us in this important effort!

avatar

The Princeton Bitcoin textbook is now freely available

The first complete draft of the Princeton Bitcoin textbook is now freely available. We’re very happy with how the book turned out: it’s comprehensive, at over 300 pages, but has a conversational style that keeps it readable.

If you’re looking to truly understand how Bitcoin works at a technical level and have a basic familiarity with computer science and programming, this book is for you. Researchers and advanced students will find the book useful as well — starting around Chapter 5, most chapters have novel intellectual contributions.

Princeton University Press is publishing the official, peer-reviewed, polished, and professionally done version of this book. It will be out this summer. If you’d like to be notified when it comes out, you should sign up here.

Several courses have already used an earlier draft of the book in their classes, including Stanford’s CS 251. If you’re an instructor looking to use the book in your class, we welcome you to , and we’d be happy to share additional teaching materials with you.

Online course and supplementary materials. The Coursera course accompanying this book had 30,000 students in its first version, and it was a success based on engagement and end-of-course feedback. 

We plan to offer a version with some improvements shortly. Specifically, we’ll be integrating the programming assignments developed for the Stanford course with our own, with Dan Boneh’s gracious permission. We also have tenative plans to record a lecture on Ethereum (we’ve added a discussion of Ethereum to the book in Chapter 10).

Finally, graduate students at Princeton have been leading the charge on several exciting research projects in this space. Watch this blog or my Twitter for updates.

avatar

Updating the Defend Trade Secrets Act?

Despite statements to the contrary by sponsors and supporters in April 2014, August 2015, and October 2015, backers of the Defend Trade Secrets Act (DTSA) now aver that “cyber espionage is not the primary focus” of the legislation. At last month’s Senate Judiciary Committee hearing, the DTSA was instead supported by two different primary reasons: the rise of trade secret theft by rogue employees and the need for uniformity in trade secret law.

While a change in a policy argument is not inherently bad, the alteration of the core justification for a bill should be considered when assessing it. Perhaps the new position of DTSA proponents acknowledges the arguments by over 40 academics, including me, that the DTSA will not reduce cyberespionage. However, we also disputed these new rationales in that letter: the rogue employee is more than adequately addressed by existing trade secret law, and there will be less uniformity in trade secrecy under the DTSA because of the lack of federal jurisprudence.

The downsides — including weakened industry cybersecurity, abusive litigation against small entities, and resurrection of the anti-employee inevitable disclosure doctrine — remain. As such, I continue to oppose the DTSA as a giant trade secrecy policy experiment with little data to back up its benefits and much evidence of its costs.

avatar

Who Will Secure the Internet of Things?

Over the past several months, CITP-affiliated Ph.D. student Sarthak Grover and fellow Roya Ensafi been investigating various security and privacy vulnerabilities of Internet of Things (IoT) devices in the home network, to get a better sense of the current state of smart devices that many consumers have begun to install in their homes.

To explore this question, we purchased a collection of popular IoT devices, connected them to a laboratory network at CITP, and monitored the traffic that these devices exchanged with the public Internet. We initially expected that end-to-end encryption might foil our attempts to monitor the traffic to and from these devices. The devices we explored included a Belkin WeMo Switch, the Nest Thermostat, an Ubi Smart Speaker, a Sharx Security Camera, a PixStar Digital Photoframe, and a Smartthings hub.

What We Found: Be Afraid!

Many devices fail to encrypt at least some of the traffic that they send and receive. Investigating the traffic to and from these devices turned out to be much easier than expected, as many of the devices exchanged personal or private information with servers on the Internet in the clear, completely unencrypted.

We recently presented a summary of our findings to the Federal Trade Commission, last week at PrivacyCon.  The video of Sarthak’s talk is available from the FTC website, as well as on YouTube.  Among some of the more striking findings include:

  • The Nest thermostat was revealing location information of the home and weather station, including the user’s zip code, in the clear.  (Note: Nest promptly fixed this bug after we notified them.)
  • The Ubi uses unencrypted HTTP to communicate information to its portal, including voice chats, sensor readings (sound, temperature, light, humidity). It also communicates to the user using unencrypted email. Needless to say, much of this information, including the sensor readings, could reveal critical information, such as whether the user was home, or even movements within a house.
  • The Sharx security camera transmits video over unencrypted FTP; if the server for the video archive is outside of the home, this traffic could also be intercepted by an eavesdropper.
  • All traffic to and from the PixStar photoframe was sent unencrypted, revealing many user interactions with the device.

Traffic capture from Nest Thermostat in Fall 2015, showing user zip code and other information in cleartext.

Traffic capture from Ubi, which sends sensor values and states in clear text.

Some devices encrypt data traffic, but encryption may not be enough. A natural reaction to some of these findings might be that these devices should encrypt all traffic that they send and receive. Indeed, some devices we investigated (e.g., the Smartthings hub) already do so. Encryption may be a good starting point, but by itself, it appears to be insufficient for preserving user privacy.  For example, user interactions with these devices generate traffic signatures that reveal information, such as when power to an outlet has been switched on or off. It appears that simple traffic features such as traffic volume over time may be sufficient to reveal certain user activities.

In all cases, DNS queries from the devices clearly indicate the presence of these devices in a user’s home. Indeed, even when the data traffic itself is encrypted, other traffic sent in the clear, such as DNS lookups, may reveal not only the presence of certain devices in your home, but likely also information about both usage and activity patterns.

Of course, there is also the concern about how these companies may use and share the data that they collect, even if they manage to collect it securely. And, beyond the obvious and more conventional privacy and security risks, there are also potential physical risks to infrastructure that may result from these privacy and security problems.

Old problems, new constraints. Many of the security and privacy problems that we see with IoT devices sound familiar, but these problems arise in a new, unique context, which present unique challenges:

  • Fundamentally insecure. Manufacturers of consumer products have little interest in releasing software patches and may even design the device without any interfaces for patching the software in the first place.  There are various examples of insecure devices that ordinary users may connect to the network without any attempts to secure them (or any means of doing so).  Occasionally, these insecure devices can result in “stepping stones” into the home for attackers to mount more extensive attacks. A recent study identified more than 500,000 insecure, publicly accessible embedded networked devices.
  • Diverse. Consumer IoT settings bring a diversity of devices, manufacturers, firmware versions, and so forth. This diversity can make it difficult for a consumer (or the consumer’s ISP) to answer even simple questions such as exhaustively identifying the set of devices that are connected to the network in the first place, let alone detecting behavior or network traffic that might reveal an anomaly, compromise, or attack.
  • Constrained. Many of the devices in an IoT network are severely resource-constrained: the devices may have limited processing or storage capabilities, or even limited battery life, and they often lack a screen or intuitive user interface. In some cases, a user may not even be able to log into the device.  

Complicating matters, a user has limited control over the IoT device, particularly as compared to a general-purpose computing platform. When we connect a general purpose device to a network, we typically have at least a modicum of choice and control about what software we run (e.g., browser, operating system), and perhaps some more visibility or control into how that device interacts with other devices on the network and on the public Internet. When we connect a camera, thermostat, or sensor to our network, the hardware and software are much more tightly integrated, and our visibility into and control over that device is much more limited. At this point, we have trouble, for example, even knowing that a device might be sending private data to the Internet, let alone being able to stop it.

Compounding all of these problems, of course, is the access a consumer gives an untrusted IoT device to other data or devices on the home network, simply by connecting it to the network—effectively placing it behind the firewall and giving it full network access, including in many cases the shared key for the Wi-Fi network.

A Way Forward

Ultimately, multiple stakeholders may be involved with ensuring the security of a networked IoT device, including consumers, manufacturers, and Internet service providers. There remain many unanswered questions concerning both who is able to (and responsible for) securing these devices, but we should start the discussion about how to improve the security for networks with IoT devices.

This discussion will include both policy aspects (including who bears the ultimate responsibility for device insecurity, whether devices need to adopt standard practices or behavior, and for how long their manufacturers should continue to support them), as well as technical aspects (including how we design the network to better monitor and control the behavior of these often-insecure devices).

Devices should be more transparent. The first step towards improving security and privacy for IoT should be to work with manufacturers to improve the transparency of these IoT devices, so that consumers (and possibly ISPs) have more visibility into what software the devices are running, and what traffic they are sending and receiving. This, of course, is a Herculean effort, given the vast quantity and diversity of device manufacturers; an alternative would be trying to infer what devices are connected to the network based on their traffic behavior, but doing so in a way that is both comprehensive, accurate, and reasonably informative seems extremely difficult.

Instead, some IoT device manufacturers might standardize on a manifest protocol that announces basic information, such as the device type, available sensors, firmware version, the set of destinations the device expects to communicate with (and whether the traffic is encrypted), and so forth. (Of course, such a manifest poses its own security risks.)

Network infrastructure can play a role. Given such basic information, anomalous behavior that is suggestive of a compromise or data leak would be more evident to network intrusion detection systems and firewalls—in other words, we could bring more of our standard network security tools to bear on these devices, once we have a way to identify what the devices are and what their expected behavior should be. Such a manifest might also serve as a concise (and machine readable!) privacy statement; a concise manifest might be one way for consumers to evaluate their comfort with a certain device, even though it may be far from a complete privacy statement.

Armed with such basic information about the devices on the network, smart network switches would have a much easier way to implement network security policies. For example, a user could specify that the smart camera should never be able to communicate with the public Internet, or that the thermostat should only be able to interact with the window locks if someone is present.

Current network switches don’t provide easy mechanisms for consumers to either express or implement these types of policies. Advances in Software-Defined Networking (SDN) in software switches such as Open vSwitch may make it possible to implement policies that resolve contention for shared resources and conflicts, or to isolate devices on the network from one another, but even if that is a reasonable engineering direction, this technology will only take us part of the way, as users will ultimately need far better interfaces to both monitor network activity and express policies about how these devices should behave and exchange traffic.

Update [20 Jan 2015]: After significant press coverage, Nest has contacted the media to clarify that the information being leaked in cleartext was not the zip code of the thermostat, but merely the zip code that the user enters when configuring the device. (Clarifying statement here.) Of course, when would a user ever enter a zip code other than that of their home, where the thermostat was located?

avatar

The Web Privacy Problem is a Transparency Problem: Introducing the OpenWPM measurement tool

In a previous blog post I explored the success of our study, The Web Never Forgets, in having a positive impact on web privacy. To ensure a lasting impact, we’ve been doing monthly, automated 1-million-site measurement of tracking and privacy. Soon we’ll be releasing these datasets and our findings. But in this post I’d like to introduce our web measurement platform OpenWPM that we’ve built for this purpose. OpenWPM has been quickly gaining adoption — it has already been used by at least 6 other research groups, as well as journalists, regulators, and students for class projects. In this post, I’ll explain why we built OpenWPM, describe a previously unmeasured type of tracking we found using the tool, and show you how you can participate and contribute to this community effort.

This post is based on a talk I gave at the FTC’s PrivacyCon. You can watch the video online here.

Why monthly, large-scale measurements are necessary

In my previous post, I showed how measurements from academic studies can help improve online privacy, but I also pointed out how they can fall short. Measurement results often have an immediate impact on online privacy. Unless that impact leads to a technical, policy, or legal solution, the impact will taper off over time as the measurements age.

Technical solutions do not always exist for privacy violations. I discussed how canvas fingerprinting can’t be prevented without sacrificing usability in my previous blog post, but there are others as well. For example, it has proven difficult to find a satisfactory solution to the privacy concerns surrounding WebRTC’s access to local IPs. This is also highlighted in the unofficial W3C draft on Fingerprinting Guidance for Web Specification Authors, which states: “elimination of the capability of browser fingerprinting by a determined adversary through solely technical means that are widely deployed is implausible.”

It seems inevitable that measurement results will go out of date, for two reasons. Most obviously, there is a high engineering cost to running privacy studies. Equally important is the fact that academic papers in this area are published as much for their methodological novelty as for their measurement results. Updating the results of an earlier study is unlikely to lead to a publication, which takes away the incentive to do it at all. [1]

OpenWPM: our platform for automated, large-scale web privacy measurements

We built OpenWPM (Github, technical report), a generic platform for online tracking measurement. It provides the stability and instrumentation necessary to run many online privacy studies. Our goal in developing OpenWPM is to decrease the initial engineering cost of studies and make running a measurement as effortless as possible. It has already been used in several published studies from multiple institutions to detect and reverse engineer online tracking.

OpenWPM also makes it possible to run large-scale measurements with Firefox, a real consumer browser [2]. Large scale measurement lets us compare the privacy practices of the most popular sites to those in the long tail. This is especially important when observing the use of a tracking technique highlighted in a measurement study. For example, we can check if it’s removed from popular sites but added to less popular sites.

Transparency through measurement, on 1 million sites

We are using OpenWPM to run the Princeton Transparency Census, a monthly web-scale measurement of tracking techniques and privacy issues, comprising 1 million sites. With it, we will be able to detect and measure many of the known privacy violations reported by researchers so far: the use of stateful tracking mechanisms, browser fingerprinting, cookie synchronization, and more.

During the measurements, we’ll collect data in three categories: (1)  network traffic — all HTTP requests and response headers (2) client-side state — cookies, Flash cookies, etc. (3) execution traces — we trap and record targeted JavaScript API calls that have been known to be used for tracking. In addition to releasing all of the raw data collected during the census, we’ll release the results of our own automated analysis.

Alongside the 1 million site measurement, we are also running smaller, targeted measurements with different browser configurations. Examples include crawling deeper into the site or browsing with a privacy extension, such as Ghostery or AdBlock Plus. These smaller crawls will provide additional insight into the privacy threats faced by real users.

Detecting WebRTC local IP discovery

As a case study of the ease of introducing a new measurement into the infrastructure, I’ll walk through the steps I took to measure scripts using WebRTC to discover a machine’s local IP address [3]. For machines behind a home router, this technique may reveal an IP of the form 192.168.1.*. Users of corporate or university networks may return a unique local IP address from within that organization’s IP range.

A user’s local IP address adds additional information to a browser fingerprint. For example, it can be used as a way to differentiate multiple users behind a NAT without requiring browser state. The amount of identifying information it provides for the average user hasn’t been studied. However, both Chrome and Firefox [4] have implemented opt-in solutions to prevent the technique. The first reported use that I could find for this technique in the wild was a third-party on nytimes.com in July 2015.

After examining a demo script, I decided to record all property access and all method calls of the RTCPeerConnection interface, the primary interface for WebRTC. The additional instrumentation necessary for this interface is just a single line of Javascript in OpenWPM’s Firefox extension.

A preliminary analysis [5] of a 50,000 site pilot measurement from October 2015 suggests that WebRTC local IP discovery is used on the homepages of over 100 sites, from over 20 distinct scripts. Only 1 of these scripts would be blocked by EasyList or EasyPrivacy.

How can this be useful for you

We envision several ways researchers and other members of the community can make use of  OpenWPM and our measurements. I’ve listed them here from least involved to most involved.

(1) Use our measurement data for their own tools. In my analysis of canvas fingerprinting I mentioned that Disconnect incorporated our research results into their blocklist. We want to make it easy for privacy tools to make use of the analysis we run, by releasing analysis data in a structured, machine readable way.

(2) Use the data collected during our measurements, and build their own analysis on top of it. We know we’ll never be able to take the human element out of these studies. Detection methodologies will change, new features of the browser will be released and others will change. The depth of the Transparency measurements should make it easy test new ideas, with the option of contributing them back to the regular crawls.

(3) Use OpenWPM to collect and release their own data. This is the model we see most web privacy researchers opting for, and a model we plan to use for most of our own studies. The platform can be used and tweaked as necessary for the individual study, and the measurement results and data can be shared publicly after the study is complete.

(4) Contribute to OpenWPM through pull requests. This is the deepest level of involvement we see. Other developers can write new features into the infrastructure for their own studies or to be run as part of our transparency measurements. Contributions here will benefit all users of OpenWPM.

Over the coming months we will release new blog posts and research results on the measurements I’ve discussed in this post. You can follow our progress here on Freedom to Tinker, on Twitter @s_englehardt, and on our Github repository.

 

[1] Notable exceptions include the study of cookie respawning: 2009, 2011, 2011, 2014. and the statistics on stateful tracking use and third-party inclusion: 2009, 2012, 2012, 2012, 2015.

[2] Crawling with a real browser is important for two reasons: (1) it’s less likely to be detected as a bot, meaning we’re less likely to receive different treatment from a normal user, and (2) a real browser supports all the modern web features (e.g. WebRTC, HTML5 audio and video), plugins (e.g. Flash), and extensions (e.g. Ghostery, HTTPS Everywhere). Many of these additional features play a large role in the average user’s privacy online.

[3] There is an additional concern that WebRTC can be used to determine a VPN user’s actual IP address, however this attack is distinct from the one described in this post.

[4] uBlock Origin also provides an option to prevent WebRTC local IP discovery on Firefox.

[5] We are in the process of running and verifying this analysis on a our 1 million site measurements, and will release an updated analysis with more details in the future.

avatar

Do privacy studies help? A Retrospective look at Canvas Fingerprinting

It seems like every month we hear of some new online privacy violation in the news, on topics such as fingerprinting or web tracking. Many of these news stories highlight academic research. What we don’t see is whether these studies and the subsequent news stories have any impact on privacy.

Our 2014 canvas fingerprinting measurement offers an opportunity for me to provide that insight, as we ended up receiving a surprising amount of press coverage after releasing the paper. In this post I’ll examine the reaction to the paper and explore which aspects contributed to its positive impact on privacy. I’ll also explore how we can use this knowledge when designing future studies to maximize their impact.

What we found in 2014

The 2014 measurement paper, The Web Never Forgets, is a collaboration with researchers at KU Leuven. In it, we measured the prevalence of three persistent tracking techniques online: canvas fingerprinting, cookie respawning, and cookie syncing [1]. They are persistent in that are hard to control, hard to detect, and resilient to blocking or removing.

We found that 5% of the top 100,000 sites were utilizing the HTML5 Canvas API as a fingerprinting vector. The overwhelming majority of which, 97%, was caused by the top two providers. The ability to use the HTML5 Canvas as a fingerprinting vector was first introduced in a 2012 paper by Mowery and Shacham. In the time between that 2012 paper and our 2014 measurement, approximately 20 sites and trackers started using canvas to fingerprint their visitors.

Several examples of the text written to the canvas for fingerprinting purposes. Each of these images would be converted to strings and then hashed to create an identifier.

The reaction to our study

Shortly after we released our paper publicly, we saw a significant amount of press coverage, including articles on ProPublica, BBC, Der Spiegel, and more. The amount of coverage our paper received was a surprise for us; we weren’t the creators of the method, and we certainly weren’t the first to report on the fingerprintability of browsers [2]. Just two days later, AddThis stopped using canvas fingerprinting. The second largest provider at the time, Ligatus, also stopped using the technique.

As can be expected, we saw many users take their frustrations to Twitter. There are users who wondered why publishers would fingerprint them:

complained about AddThis:

and expressed their dislike for canvas fingerprinting in general:

We even saw a user question as to why Mozilla does not protect against canvas fingerprinting in Firefox:

However a general technical solution which preserves the API’s usefulness and usability doesn’t exist [3]. Instead the best solutions are either blocking the feature or blocking the trackers which use it.

The developer community responded by releasing canvas blocking extensions for Firefox and Chrome, tools which are used by over 18,000 users in total. AdBlockPlus and Disconnect both commented that the large trackers are already on their block lists, with Disconnect mentioning that the additional, lesser-known parties from our study would be added to their lists.

Why was our study so impactful?

Much of the online privacy problem is actually a transparency problem. By default, users have very little information on the privacy practices of the websites they visit, and of the trackers included on those sites. Without this information users are unable to differentiate between sites which take steps to protect their privacy and sites which don’t. This leads to less of an incentive for site owners to protect the privacy of their users, as online privacy often comes at the expense of additional ad revenue or third-party features.

With our study, we were not only able to remove this information asymmetry [4], but were able to do so in a way that was relatable to users. The visual representation of canvas fingerprinting proved particularly helpful in removing that asymmetry of information; it was very intuitive to see how the shapes drawn to a canvas could produce a unique image. The ProPublica article even included a demo where users could see their fingerprint built in real time.

While writing the paper we made it a point to include not only the trackers responsible for fingerprinting, but to also include the sites on which the fingerprinting was taking place. Instead of reading that tracker A was responsible for fingerprinting, they could understand that it occurs when they visit publishers X, Y and Z. If a user is frustrated by a technique, and is only familiar with the tracker responsible, there isn’t much they can do. By knowing the publishers on which the technique is used, they can voice their frustrations or choose to visit alternative sites. Publishes, which have in interest in keeping users, will then have an incentive to change their practices.

The technique wasn’t only news to users, even some site owners were unaware that it was being used on their sites. ProPublica updated their original story with a message from YouPorn stating, “[the website was] completely unaware that AddThis contained a tracking software…”, and had since removed it. This shows that measurement work can even help remove the information asymmetry between trackers and the sites upon which they track.

How are things now?

In a re-run of the measurements in January 2016 [5], I’ve observed that the number of distinct trackers utilizing canvas fingerprinting has more than doubled since our 2014 measurement. While the overall number of publisher sites on which the tracking occurs is still below that of our previous measurement, the use of the technique has at least partially revived since AddThis and Ligatus stopped the practice.

This made me curious if we see similar trends for other tracking techniques. In our 2014 paper we also studied cookie respawning [6]. This technique was well studied in the past, both in 2009 and 2011, making it a good candidate to analyze the longitudinal effects of measurement.  As is the case with our measurement, these studies also received a bit of press coverage when released.

The 2009 study, which found HTTP cookie respawning on 6 of the top 100 sites, resulted in a $2.4 million settlement. The 2011 follow-up study found that the use of respawning decreased to just 2 sites in the top 100, and likewise resulted in a $500 thousand settlement. In 2014 we observed respawning on 7 of the top 100 sites, however none of these sites or trackers were US-based entities. This suggests that lawsuits can have an impact, but that impact may be limited by the global nature of the web.

What we’ve learned

Providing transparency into privacy violations online has the potential for huge impact. We saw users unhappy with the trackers that use canvas fingerprinting, with the sites that include those trackers, and even with the browsers they use to visit those sites. It is important that studies visit a large number of sites, and list those on which the privacy violation occurs.

The pressure of transparency affects the larger players more than the long tail. A tracker which is present on a large number of sites, or is present on sites which receive more traffic is more likely to be the focus of news articles or subject to lawsuits. Indeed, our 2016 measurements support it: we’ve seen a large increase in the number of parties involved, but the increase is limited to parties with a much smaller presence.

In the absence of a lawsuit, policy change, or technical solution, we see that canvas fingerprinting use is beginning to grow again. Without constant monitoring and transparency, level of privacy violations can easily creep back to where they were. A single, well-connected tracker can re-introduce a tracking technique to a large number of first-parties.

The developer community will help, we just need to provide them with the data they need. Our detection methodology served as the foundation for blocking tools, which intercept the same calls we used for detection. The script lists we included in our paper and on our website were incorporated into current blocklists.

In a follow-up post, I’ll discuss the work we’re doing to make regular, large scale measurements of web tracking a reality. I’ll show how the tools we’ve built make it possible to run automated, million site crawls can run every month, and I’ll introduce some new results we’re planning to release.

 

[1] The paper’s website provides a short description of each of these techniques.

[2] See: the EFF’s Panopticlick, and academic papers Cookieless Monster and FPDetective.

[3] For example, adding noise to canvas readouts has the potential to cause problems for non-tracking use cases and can still be defeated by a determined tracker. The Tor Browser’s solution of prompting the user on certain canvas calls does work, however it requires a user’s understanding that the technique can be used for tracking and provides for a less than optimal user experience.

[4] For a nice discussion of information asymmetry and the web: Privacy and the Market for Lemons, or How Websites Are Like Used Cars

[5] These measurements were run using the canvas instrumentation portion of OpenWPM.

[6] For a detailed description of cookie respawning, I suggest reading through Ashkan Soltani’s blog post on the subject.

Thanks to Arvind Narayanan for his helpful comments.