September 25, 2017

Web Census Notebook: A new tool for studying web privacy

As part of the Web Transparency and Accountability Project, we’ve been visiting the web’s top 1 million sites every month using our open-source privacy measurement tool OpenWPM. This has led to numerous worrying findings such as the systematic abuse of newly introduced web features for fingerprinting, leading to better privacy tools and occasionally strong responses from browser vendors.

Enabling research is great — OpenWPM has led to 14 papers so far — but research is slow and requires expertise. To make our work more directly useful, today we’re announcing a new tool to study web privacy: a Jupyter notebook interface and a set of libraries to quickly answer most questions about web tracking by querying the the 500 GB of data we collect every month.

Jupyter notebook is an intuitive tool for data analysis using Python, and it’s what we use here internally for much of our own research. Notebooks are accessible with a simple web interface; yet the code, data, and memory persists on the server if you close the browser and return to it later (even from a different device). Notebooks combine code with visualizations, making them ideal for data exploration and analysis.

Who could benefit from this tool? We envision uses such as these:

  • Publishers could use our data to understand third-party tracking on their own websites.
  • Journalists could use our data to investigate and expose privacy-infringing practices.
  • Regulators and enforcement agencies could use our tool in investigations.
  • Creators of browser privacy tools could use our data to test their effectiveness.

Let’s look at an example that shows the feel of the interface. The code below computes the average number of embedded trackers on the top 100 websites in various categories such as “news” and “shopping”. It is intuitive and succinct. Without our interface, not only would the SQL version of this query be much more cumbersome, but it would require a ton of legwork and setup to even get to a point where you can write the query. Now you just need to point your browser at our notebook.

    for category, domains in census.first_parties.alexa_categories.items():
        avg = sum(1 for first_party in domains[:100]
                    for third_party in first_party.third_party_resources
                    if third_party.is_tracker) / 100
        print("Average number of trackers on %s sites: %.1f" % (category, avg))

The results confirm our finding that news sites have the most trackers, and adult sites the least. [1]

Here’s what happens behind the scenes:

  • census is a Python object that exposes all the relationships between websites and third parties as object attributes, hiding the messy details of the underlying database schema. Each first party is represented by a FirstParty object that gives access to each third-party resource (URI object) on the first party, and the ThirdParty that the URI belongs to. When the objects are accessed, they are instantiated automatically by querying the database.
  • census.first_parties is a container of FirstParty objects ordered by Alexa traffic rank, so you can easily analyze the top sites, or sites in the long tail, or specific sites. You can also easily slice the sites by category: in the example above, we iterate through each category of census.first_parties.alexa_categories.
  • There’s a fair bit of logic that goes into analyzing the crawl data which third parties are embedded on which websites, and cross-referencing that with tracking-protection lists to figure out which of those are trackers. This work is already done for you, and exposed via attributes such as ThirdParty.is_tracker.

Since the notebooks run on our server, we expect to be able to support only a limited number (a few dozen) at this point, so you need to apply for access. The tool is currently in beta as we smooth out rough edges and add features, but it is usable and useful. Of course, you’re welcome to run the notebook on your own server — the underlying crawl datasets are public, and we’ll release the code behind the notebooks soon. We hope you find this of use to you, and we welcome your feedback.

 

[1] The linked graph from our paper measures the number of distinct domains whereas the query above counts every instance of every tracker. The trends are the same in both cases, but the numbers are different. Here’s the output of the query:

 

Average number of third party trackers on computers sites: 41.0
Average number of third party trackers on regional sites: 68.8
Average number of third party trackers on recreation sites: 58.2
Average number of third party trackers on health sites: 38.4
Average number of third party trackers on news sites: 151.2
Average number of third party trackers on business sites: 55.0
Average number of third party trackers on kids_and_teens sites: 74.8
Average number of third party trackers on home sites: 94.5
Average number of third party trackers on arts sites: 108.6
Average number of third party trackers on sports sites: 86.6
Average number of third party trackers on reference sites: 43.8
Average number of third party trackers on science sites: 43.1
Average number of third party trackers on society sites: 73.5
Average number of third party trackers on shopping sites: 53.1
Average number of third party trackers on adult sites: 16.8
Average number of third party trackers on games sites: 70.5

How to buy physical goods using Bitcoin with improved security and privacy

Bitcoin has found success as a decentralized digital currency, but it is only one step toward decentralized digital commerce. Indeed, creating decentralized marketplaces and mechanisms is a nascent and active area of research. In a new paper, we present escrow protocols for cryptocurrencies that bring us closer to decentralized commerce.

In any online sale of physical goods, there is a circular dependency: the buyer only wants to pay once he receives his goods, but the seller only wants to ship them once she’s received payment. This is a problem regardless of whether one pays with bitcoins or with dollars, and the usual solution is to utilize a trusted third party. Credit card companies play this role, as do platforms such as Amazon and eBay. Crucially, the third party must be able to mediate in case of a dispute and determine whether the seller gets paid or the buyer receives a refund.

A key requirement for successful decentralized marketplaces is to weaken the role of such intermediaries, both because they are natural points of centralization and because unregulated intermediaries have tended to prove untrustworthy. In the infamous Silk Road marketplace, buyers would send payment to Silk Road, which would hold it in escrow. Note that escrow is necessary because it is not possible to reverse cryptocurrency transactions, unlike credit card payments. If all went well, Silk Road would forward the money to the seller; otherwise, it would mediate the dispute. Time and time again, the operators of these marketplaces have absconded with the funds in escrow, underscoring that this isn’t a secure model.

Lately, there have been various services that offer a more secure version of escrow payment. Using 2-of-3 multisignature transactions, the buyer, seller, and a trusted third party each hold one key. The buyer pays into a multisignature address that requires that any two of these three keys sign in order for the money to be spent. If the buyer and seller are in agreement, they can jointly issue payment. If there’s a dispute, the third party mediates. The third party and the winner of the dispute will then use their respective keys to issue a payout transaction to the winner.

This escrow protocol has two nice features. First, if there’s no dispute, the buyer and seller can settle without involving the third party. Second, the third party cannot run away with the money as it only holds one key, while two are necessary spend the escrowed funds.

Until now, the escrow conversation has generally stopped here. But in our paper we ask several further important questions. To start, there are privacy concerns. Unless the escrow protocol is carefully designed, anyone observing the blockchain might be able to spot escrow transactions. They might even be able to tell which transactions were disputed, and connect those to specific buyers and sellers.

In a previous paper, we showed that using multisignatures to split control over a wallet leads to major privacy leaks, and we advocated using threshold signatures instead of multisignatures. It turns out that using multisignatures for escrow has similar negative privacy implications. While using 2-of-3 threshold signatures instead of multisignatures would solve the privacy problem, it would introduce other undesirable features in the context of escrow as we explain in the paper.

Moreover, the naive escrow protocol above has a gaping security flaw: even though the third party cannot steal the money, it can refuse to mediate any disputes and thus keep the money locked up.

In addition to these privacy and security requirements, we study group escrow. In such a system, the transacting parties may choose multiple third parties from among a set of escrow service providers and have them mediate disputes by majority vote. Again, we analyze both the privacy and the security of the resulting schemes, as well as the details of group formation and communication.

Our goal in this paper is not to provide a definitive set of requirements for escrow services. We spoke with many Bitcoin escrow companies in the course of our research — it’s a surprisingly active space — and realized that there is no single set of properties that works for every use-case. For example, we’ve looked at privacy as a desirable property so far, but buyers may instead want to be able to examine the blockchain and identify how often a given seller was involved in disputes. In our paper, we present a toolbox of escrow protocols as well as a framework for evaluating them, so that anyone can choose the protocol that best fits their needs and be fully aware of the security and privacy implications of that choice.

We’ll present the paper at the Financial Cryptography conference in two weeks.

Pragmatic advice for buying “Internet of Things” devices

We’re hearing an increasing amount about security flaws in “Internet of Things” devices, such as a “messaging” teddy bear with poor security or perhaps Samsung televisions being hackable to become snooping devices. How are you supposed to make purchasing decisions for all of these devices when you have no idea how they work or if they’re flawed?

Threat modeling and understanding the vendor’s stance. If a device comes from a large company with a reputation for caring about security (e.g., Apple, Google, and yes, even Microsoft), then that’s a positive factor. If the device comes from a fresh startup, or from a no-name Chinese manufacturer, that’s a negative factor. One particular thing that you might look for is evidence that the device in question automatically updates itself without requiring any intervention on your behalf. You might also consider the vendor’s commitment to security features as part of the device’s design. And, you should consider how badly things could go wrong if that device was compromised. Let’s go through some examples.

Your home’s border router. When we’re talking about your home’s firewall / NAT / router box, a compromise is quite significant, as it would allow an attacker to be a full-blown man-in-the-middle on all your traffic. Of course, with the increasing use of SSL-secured web sites, this is less devastating than you’d think. And when our ISPs might be given carte blanche to measure and sell any information about you, you already need to actively mistrust your Internet connection. Still, if you’ve got insecure devices on your home network, your border router matters a lot.

A few years ago, I bought a pricey Apple Airport Extreme, which Apple kept updated automatically. It has been easy to configure and manage and it faithfully served my home network. But then Apple supposedly decided to abandon the product. This was enough for me to start looking around for alternatives, wherein I settled on the new Google WiFi system, not only because it does a clever mesh network thing for whole-home coverage, but because Google explicitly claims security features (automatic updates, trusted boot, etc.) as part of its design. If you decide you don’t trust Google, then you should evaluate other vendors’ security claims carefully rather than just buying the cheapest device at the local electronics store.

Your front door / the outside of your house. Several vendors offer high-tech door locks that speak Bluetooth or otherwise can open themselves without requiring a mechanical key. Other vendors offer “video doorbells”. And a number of IoT vendors have replacements for traditional home security systems, using your Internet connection for connecting to a “monitoring” service (and, in some cases, using a cellular radio connection as a backup). For my own house, I decided that a Ring video doorbell was a valuable idea, based on its various advertised features, but also if it’s compromised, nobody can see into my house. In the worst case, an attacker can learn the times that I arrive and leave from my house, which aren’t exactly a nation-state secret. Conversely, I stuck with our traditional mechanical door locks. Sure, they’re surprisingly easy to pick, but I might at least end up with a nice video of the thief. I’m assuming that I have more to risk from “smash and grab” amateur thieves than from careful professionals. Ultimately, we do have insurance for these sorts of things. Speaking of which, Ring provides a free replacement if somebody steals your video doorbell. That’s as much a threat as anything.

Your home interior. Unlike the outside, I decided against any sort of always-on audio or video devices inside my house. No NestCam. No “smart” televisions. No Alexa or Google Home. After all the hubbub about baby monitors being actively cataloged and compromised, I wouldn’t be willing to have any such devices on my network because the risks outweigh the benefits. On the flip side, I’ve got no problem with my Nest thermostats. They’re incredibly convenient, and the vendor seems to have kept up with software patches and feature upgrades, continuing to support my first-generation devices. If compromised, an attacker might be able to overheat my house or perhaps damage my air conditioner by power-cycling it too rapidly. Possible? Yes. Likely? I doubt it. As with the video doorbell, there’s also a risk that an attacker could profile when I leave in the morning and get home in the evening.

Your mobile phones. The only in-home always-on audio surveillance is the “Ok Google” functionality in our various Android devices, which leads to a broader consideration of mobile phone security. All of our phones are Google Nexus or Pixel devices, so are running the latest release builds from Google. I’d feel similarly comfortable with the latest Apple devices. Suffice to say that mobile phone security is really an entirely separate topic from IoT security, but many of the same considerations apply: is the vendor supplying regular security patches, etc.

Your home theater. As mentioned above, I’m not interested in “smart” devices that can turn into surveillance devices. Our “smart TV” solution is a TiVo device: actively maintained by TiVo, and with no microphone or camera. If compromised, an attacker could learn what we’re watching, but again, there are no deep secrets that need to be protected. (Gratuitous plug: SyFy’s “The Expanse” is fantastic.) Our TV itself is not connected to anything other than the TiVo and a Chromecast device (which, again, has a remarkably limited attack surface; it’s basically just a web browser without the buttons around the edges).

I’m pondering a number of 4K televisions to replace our older TV, and they all come with “smarts” built-in. For most TV vendors, I’d just treat them as “dumb” displays, but I might make an exception for an Android TV device. I’ll note that Google abandoned support for its earlier Google TV systems, including a Sony-branded Google TV Bluray player that I bought back in the day, so I currently use it as a dumb Bluray player rather than as a smart device for running apps and such. My TiVo and Chromecast provide the “smart” features we need and both are actively supported. Suffice to say that when you buy a big television, you should expect it to last a decade or more, so it’s good to have the “smart” parts in smaller/cheaper devices that you can replace or upgrade on a more frequent basis.

Other gadgets. In our home, we’ve got a variety of other “smart” devices on the network, including a Rachio sprinkler controller, a Samsung printer, an Obihai VoIP gateway, and a controller for our solar panel array (powerline networking to gather energy production data from each panel!). The Obihai and the Samsung don’t do automatic updates and are probably security disaster areas. The Obihai apparently only encrypts control traffic with internet VoIP providers, while the data traffic is unencrypted. So do I need to worry about them? Certainly, if an attacker could somehow break in from one device and move laterally to another, the printer and the VoIP devices are the tastiest targets, as an attacker could see what I print (hint: nothing very exciting, unless you really love grocery shopping lists) or listen in on my phone calls (hint: if it’s important, I’d use Signal or I’d have an in-person conversation without electronics in earshot).

Some usability thoughts. After installing all these IoT devices, one of the common themes that I’ve observed is that they all have radically different setup procedures. A Nest thermostat, for example, has you spin the dial to enter your WiFi password, but other devices don’t have dials. What should they do? Nest Protect smoke detectors, for example, have a QR code printed on them which drives your phone to connect to a local WiFi access point inside the smoke detector. This is used to communicate the password for your real WiFi network, after which the local WiFi is never again used. For contrast, the Rachio sprinkler system uses a light sensor on the device that reads color patterns from your smartphone screen, which again send it the configuration information to connect to your real WiFi network. These setup processes, and others like them, are making a tradeoff across security, usability, and cost. I don’t have any magic thoughts on how best to support the “IoT pairing problem”, but it’s clearly one of the places where IoT security matters.

Security for “Internet of Things” devices is a topic of growing importance, at home and in the office. These devices offer all kinds of great features, whether it’s a sprinkler controller paying attention to the weather forecast or a smoke alarm that alerts you on your phone. Because they deliver useful features, they’re going to grow in popularity. Unfortunately, it’s not to possible for most consumers to make meaningful security judgments about these devices, and even web sites that specialize in gadget reviews don’t have security analysts on staff. Consequently, consumers are forced to make tradeoffs (e.g., no video cameras inside the house) or to use device brands as a proxy for measuring the quality and security of these devices.