June 27, 2017

The Princeton Web Census: a 1-million-site measurement and analysis of web privacy

Web privacy measurement — observing websites and services to detect, characterize, and quantify privacy impacting behaviors — has repeatedly forced companies to improve their privacy practices due to public pressure, press coverage, and regulatory action. In previous blog posts I’ve analyzed why our 2014 collaboration with KU Leuven researchers studying canvas fingerprinting was successful, and discussed why repeated, large-scale measurement is necessary.

Today I’m pleased to release initial analysis results from our monthly, 1-million-site measurement. This is the largest and most detailed measurement of online tracking to date, including measurements for stateful (cookie-based) and stateless (fingerprinting-based) tracking, the effect of browser privacy tools, and “cookie syncing”.  These results represent a snapshot of web tracking, but the analysis is part of an effort to collect data on a monthly basis and analyze the evolution of web tracking and privacy over time.

Our measurement platform used for this study, OpenWPM, is already open source. Today, we’re making the datasets for this analysis available for download by the public. You can find download instructions on our study’s website.

New findings

We provide background information and summary of each of our main findings on our study’s website. The paper goes into even greater detail and provides the methodological details on the measurement and analysis of each finding. One of our more surprising findings was the discovery of two apparent attempts to use the HTML5 Audio API for fingerprinting.

The figure is a visualization of the audio processing executed on users’ browsers by third-party fingerprinting scripts. We found two different AudioNode configurations in use. In both configurations an audio signal is generated by an oscillator and the resulting signal is hashed to create an identifier. Initial testing shows that the techniques may have some limitations when used for fingerprinting, but further analysis is necessary. You can help us with that (and test your own device) by using our demonstration page here.

See the paper for our analysis of a consolidated third-party ecosystem, the effects of third parties on HTTPS adoption, and examine the performance of tracking protection tools. In addition to audio fingerprinting, we show that canvas fingerprint is being used by more third parties, but on less sites; that a WebRTC feature can and is being used for tracking; and how the HTML Canvas is being used to discover user’s fonts.

What’s next? We are exploring ways to share our data and analysis tools in a form that’s useful to a wider and less technical audience. As we continue to collect data, we will also perform longitudinal analyses of web tracking. In other ongoing research, we’re using the data we’ve collected to train machine-learning models to automatically detect tracking and fingerprinting.