May 22, 2017

The Princeton Web Census: a 1-million-site measurement and analysis of web privacy

Web privacy measurement — observing websites and services to detect, characterize, and quantify privacy impacting behaviors — has repeatedly forced companies to improve their privacy practices due to public pressure, press coverage, and regulatory action. In previous blog posts I’ve analyzed why our 2014 collaboration with KU Leuven researchers studying canvas fingerprinting was successful, and discussed why repeated, large-scale measurement is necessary.

Today I’m pleased to release initial analysis results from our monthly, 1-million-site measurement. This is the largest and most detailed measurement of online tracking to date, including measurements for stateful (cookie-based) and stateless (fingerprinting-based) tracking, the effect of browser privacy tools, and “cookie syncing”.  These results represent a snapshot of web tracking, but the analysis is part of an effort to collect data on a monthly basis and analyze the evolution of web tracking and privacy over time.

Our measurement platform used for this study, OpenWPM, is already open source. Today, we’re making the datasets for this analysis available for download by the public. You can find download instructions on our study’s website.

New findings

We provide background information and summary of each of our main findings on our study’s website. The paper goes into even greater detail and provides the methodological details on the measurement and analysis of each finding. One of our more surprising findings was the discovery of two apparent attempts to use the HTML5 Audio API for fingerprinting.

The figure is a visualization of the audio processing executed on users’ browsers by third-party fingerprinting scripts. We found two different AudioNode configurations in use. In both configurations an audio signal is generated by an oscillator and the resulting signal is hashed to create an identifier. Initial testing shows that the techniques may have some limitations when used for fingerprinting, but further analysis is necessary. You can help us with that (and test your own device) by using our demonstration page here.

See the paper for our analysis of a consolidated third-party ecosystem, the effects of third parties on HTTPS adoption, and examine the performance of tracking protection tools. In addition to audio fingerprinting, we show that canvas fingerprint is being used by more third parties, but on less sites; that a WebRTC feature can and is being used for tracking; and how the HTML Canvas is being used to discover user’s fonts.

What’s next? We are exploring ways to share our data and analysis tools in a form that’s useful to a wider and less technical audience. As we continue to collect data, we will also perform longitudinal analyses of web tracking. In other ongoing research, we’re using the data we’ve collected to train machine-learning models to automatically detect tracking and fingerprinting.

The Web Privacy Problem is a Transparency Problem: Introducing the OpenWPM measurement tool

In a previous blog post I explored the success of our study, The Web Never Forgets, in having a positive impact on web privacy. To ensure a lasting impact, we’ve been doing monthly, automated 1-million-site measurement of tracking and privacy. Soon we’ll be releasing these datasets and our findings. But in this post I’d like to introduce our web measurement platform OpenWPM that we’ve built for this purpose. OpenWPM has been quickly gaining adoption — it has already been used by at least 6 other research groups, as well as journalists, regulators, and students for class projects. In this post, I’ll explain why we built OpenWPM, describe a previously unmeasured type of tracking we found using the tool, and show you how you can participate and contribute to this community effort.

This post is based on a talk I gave at the FTC’s PrivacyCon. You can watch the video online here.

Why monthly, large-scale measurements are necessary

In my previous post, I showed how measurements from academic studies can help improve online privacy, but I also pointed out how they can fall short. Measurement results often have an immediate impact on online privacy. Unless that impact leads to a technical, policy, or legal solution, the impact will taper off over time as the measurements age.

Technical solutions do not always exist for privacy violations. I discussed how canvas fingerprinting can’t be prevented without sacrificing usability in my previous blog post, but there are others as well. For example, it has proven difficult to find a satisfactory solution to the privacy concerns surrounding WebRTC’s access to local IPs. This is also highlighted in the unofficial W3C draft on Fingerprinting Guidance for Web Specification Authors, which states: “elimination of the capability of browser fingerprinting by a determined adversary through solely technical means that are widely deployed is implausible.”

It seems inevitable that measurement results will go out of date, for two reasons. Most obviously, there is a high engineering cost to running privacy studies. Equally important is the fact that academic papers in this area are published as much for their methodological novelty as for their measurement results. Updating the results of an earlier study is unlikely to lead to a publication, which takes away the incentive to do it at all. [1]

OpenWPM: our platform for automated, large-scale web privacy measurements

We built OpenWPM (Github, technical report), a generic platform for online tracking measurement. It provides the stability and instrumentation necessary to run many online privacy studies. Our goal in developing OpenWPM is to decrease the initial engineering cost of studies and make running a measurement as effortless as possible. It has already been used in several published studies from multiple institutions to detect and reverse engineer online tracking.

OpenWPM also makes it possible to run large-scale measurements with Firefox, a real consumer browser [2]. Large scale measurement lets us compare the privacy practices of the most popular sites to those in the long tail. This is especially important when observing the use of a tracking technique highlighted in a measurement study. For example, we can check if it’s removed from popular sites but added to less popular sites.

Transparency through measurement, on 1 million sites

We are using OpenWPM to run the Princeton Transparency Census, a monthly web-scale measurement of tracking techniques and privacy issues, comprising 1 million sites. With it, we will be able to detect and measure many of the known privacy violations reported by researchers so far: the use of stateful tracking mechanisms, browser fingerprinting, cookie synchronization, and more.

During the measurements, we’ll collect data in three categories: (1)  network traffic — all HTTP requests and response headers (2) client-side state — cookies, Flash cookies, etc. (3) execution traces — we trap and record targeted JavaScript API calls that have been known to be used for tracking. In addition to releasing all of the raw data collected during the census, we’ll release the results of our own automated analysis.

Alongside the 1 million site measurement, we are also running smaller, targeted measurements with different browser configurations. Examples include crawling deeper into the site or browsing with a privacy extension, such as Ghostery or AdBlock Plus. These smaller crawls will provide additional insight into the privacy threats faced by real users.

Detecting WebRTC local IP discovery

As a case study of the ease of introducing a new measurement into the infrastructure, I’ll walk through the steps I took to measure scripts using WebRTC to discover a machine’s local IP address [3]. For machines behind a home router, this technique may reveal an IP of the form 192.168.1.*. Users of corporate or university networks may return a unique local IP address from within that organization’s IP range.

A user’s local IP address adds additional information to a browser fingerprint. For example, it can be used as a way to differentiate multiple users behind a NAT without requiring browser state. The amount of identifying information it provides for the average user hasn’t been studied. However, both Chrome and Firefox [4] have implemented opt-in solutions to prevent the technique. The first reported use that I could find for this technique in the wild was a third-party on in July 2015.

After examining a demo script, I decided to record all property access and all method calls of the RTCPeerConnection interface, the primary interface for WebRTC. The additional instrumentation necessary for this interface is just a single line of Javascript in OpenWPM’s Firefox extension.

A preliminary analysis [5] of a 50,000 site pilot measurement from October 2015 suggests that WebRTC local IP discovery is used on the homepages of over 100 sites, from over 20 distinct scripts. Only 1 of these scripts would be blocked by EasyList or EasyPrivacy.

How can this be useful for you

We envision several ways researchers and other members of the community can make use of  OpenWPM and our measurements. I’ve listed them here from least involved to most involved.

(1) Use our measurement data for their own tools. In my analysis of canvas fingerprinting I mentioned that Disconnect incorporated our research results into their blocklist. We want to make it easy for privacy tools to make use of the analysis we run, by releasing analysis data in a structured, machine readable way.

(2) Use the data collected during our measurements, and build their own analysis on top of it. We know we’ll never be able to take the human element out of these studies. Detection methodologies will change, new features of the browser will be released and others will change. The depth of the Transparency measurements should make it easy test new ideas, with the option of contributing them back to the regular crawls.

(3) Use OpenWPM to collect and release their own data. This is the model we see most web privacy researchers opting for, and a model we plan to use for most of our own studies. The platform can be used and tweaked as necessary for the individual study, and the measurement results and data can be shared publicly after the study is complete.

(4) Contribute to OpenWPM through pull requests. This is the deepest level of involvement we see. Other developers can write new features into the infrastructure for their own studies or to be run as part of our transparency measurements. Contributions here will benefit all users of OpenWPM.

Over the coming months we will release new blog posts and research results on the measurements I’ve discussed in this post. You can follow our progress here on Freedom to Tinker, on Twitter @s_englehardt, and on our Github repository.


[1] Notable exceptions include the study of cookie respawning: 2009, 2011, 2011, 2014. and the statistics on stateful tracking use and third-party inclusion: 2009, 2012, 2012, 2012, 2015.

[2] Crawling with a real browser is important for two reasons: (1) it’s less likely to be detected as a bot, meaning we’re less likely to receive different treatment from a normal user, and (2) a real browser supports all the modern web features (e.g. WebRTC, HTML5 audio and video), plugins (e.g. Flash), and extensions (e.g. Ghostery, HTTPS Everywhere). Many of these additional features play a large role in the average user’s privacy online.

[3] There is an additional concern that WebRTC can be used to determine a VPN user’s actual IP address, however this attack is distinct from the one described in this post.

[4] uBlock Origin also provides an option to prevent WebRTC local IP discovery on Firefox.

[5] We are in the process of running and verifying this analysis on a our 1 million site measurements, and will release an updated analysis with more details in the future.