September 29, 2022

CITP’s OpenWPM privacy measurement tool moves to Mozilla

As part of my PhD at Princeton’s Center for Information Technology Policy (CITP), I led the development of OpenWPM, a tool for web privacy measurement, with the help of many contributors. My co-authors and I first released OpenWPM in 2014 with the goal of lowering the technical costs of large-scale web privacy measurement. The tool’s success exceeded our expectations; it has been used by over 30 academic studies since its release, in research areas ranging from computer science to law.

OpenWPM has a new home at Mozilla. After graduating in 2018, I joined Mozilla’s security engineering team to work on strengthening Firefox’s tracking protection. We’re committed to ensuring users are protected from tracking by default. To that end, we’ve migrated OpenWPM to Mozilla, where it will remain open source to ensure researchers have the tools required to discover privacy-infringing practices on the web. We are also using it ourselves to understand the implications of our new anti-tracking features, to discover fingerprinting scripts and add them to our tracking protection lists, as well as to collect data for a number of ongoing privacy research projects.

Over the past six months we’ve started a number of efforts to significantly improve OpenWPM:

1. Cloud-friendly data storage. OpenWPM has long used SQLite to store crawl data. This makes it easy for anyone to install the tool, run a small measurement, and inspect the dataset locally. However, this is very limiting for large-scale measurements. OpenWPM can now save data directly to Amazon S3 in Parquet format, making it possible to launch crawls on a cluster of machines.

2. Support for modern versions of Firefox. We are in the process of migrating all of OpenWPM’s instrumentation to WebExtensions, which is necessary to run measurements with Firefox 57+.

2. Modular instrumentation. OpenWPM’s instrumentation was previously deeply embedded in the crawler, making it difficult to use outside of a crawling context. We’ve now refactored the instrumentation into a separate npm package that can easily be imported by any Firefox WebExtension. In fact, we’ve already used the module to collect data in one of our user studies.

4. A standard set of analysis utilities. To further ease analyses on OpenWPM datasets, we’ve bundled the many small utility functions we’ve developed over the years into a single utilities package available on PyPI.

5. Data collection and release. Since 2015, CITP has collected monthly 1-million-site web measurements using OpenWPM. All of this data is available for download, but once Gunes Acar moves on from CITP in a few months, the CITP measurements will end. At Mozilla, we are exploring options to regularly collect and release new measurements.

All of these efforts are still underway, and we welcome community involvement as we continue to build upon them. You can find us hanging out in #openwpm on irc.mozilla.org.

No boundaries for Facebook data: third-party trackers abuse Facebook Login

by Steven Englehardt [0], Gunes Acar, and Arvind Narayanan

So far in the No boundaries series, we’ve uncovered how web trackers exfiltrate identifying information from web pages, browser password managers, and form inputs.

Today we report yet another type of surreptitious data collection by third-party scripts that we discovered: the exfiltration of personal identifiers from websites through “login with Facebook” and other such social login APIs. Specifically, we found two types of vulnerabilities [1]:

  • seven third parties abuse websites’ access to Facebook user data
  • one third party uses its own Facebook “application” to track users around the web.

 

Vulnerability 1: Third parties piggyback on Facebook access granted to websites

Diagram of third-party script accessing Facebook API

When a user clicks “Login with Facebook”, they will be prompted to allow the website they’re visiting to access some of their Facebook profile information [2]. Even after Facebook’s recent moves to lock down the feature, websites can request the user’s email address and  “public profile” (name, age range, gender, locale, and profile photo) without triggering a manual review by Facebook. Once the user allows access, any third-party Javascript embedded in the page, such as tracker.com in the figure above, can also retrieve the user’s Facebook information as if they were the first party [3].

[Read more…]

No boundaries for credentials: New password leaks to Mixpanel and Session Replay Companies

In this installment of the “No Boundaries” series we show how wholesale collection of user interactions by third-party analytics and session replay scripts cause inadvertent collection of passwords.
By Steve Englehardt, Gunes Acar and Arvind Narayanan

Following the recent report that Mixpanel, a popular analytics provider, had been inadvertently collecting passwords that users typed into websites, we took a deeper look [1]. While Mixpanel characterized it as a “bug, plain and simple” — one that it had fixed — we found that:

  • Mixpanel continues to grab passwords on some sites, even with the patched version of its code.
  • The problem is not limited to Mixpanel; also affected are session replay scripts, which we revealed earlier to be scooping up various other types of sensitive information.
  • There is no foolproof way for these third party scripts to prevent password collection, given their intended functionality. In some cases, password collection happens due to extremely subtle interactions between code from different entities.

Overall, we think that the approach of third-party scripts collecting the entirety of web pages or form inputs, and attempting to filter out sensitive information is incompatible with user security and privacy.

Password leaks are not limited to Mixpanel

In our research we found password leaks to four different third-party analytics providers across a number of websites. The sources are numerous: several variants of a “Show Password” feature added by site owners, an unexpected interaction with an unrelated third-party script, unanticipated changes to page structure by browser extensions, and even a bug in privacy tools of one of the analytics libraries. However, the underlying cause is the same: wholesale collection of user input data, with protection provided by set of blacklist-based heuristics to filter password fields. We argue that this heuristic approach is bound to fail, and provide a list of examples in which it does.

This summary is provided not as an exhaustive list of all possible vulnerabilities, but rather as examples of how things can go wrong. A detailed description of each vulnerability and the vendor response is available in the Appendix.

[Read more…]

No boundaries: Exfiltration of personal data by session-replay scripts

This is the first post in our “No Boundaries” series, in which we reveal how third-party scripts on websites have been extracting personal information in increasingly intrusive ways. [0]
by Steven Englehardt, Gunes Acar, and Arvind Narayanan

Update: we’ve released our data — the list of sites with session-replay scripts, and the sites where we’ve confirmed recording by third parties.

You may know that most websites have third-party analytics scripts that record which pages you visit and the searches you make.  But lately, more and more sites use “session replay” scripts. These scripts record your keystrokes, mouse movements, and scrolling behavior, along with the entire contents of the pages you visit, and send them to third-party servers. Unlike typical analytics services that provide aggregate statistics, these scripts are intended for the recording and playback of individual browsing sessions, as if someone is looking over your shoulder.

[Read more…]

I never signed up for this! Privacy implications of email tracking

In this post I discuss a new paper that will appear at PETS 2018, authored by myself, Jeffrey Han, and Arvind Narayanan.

What happens when you open an email and allow it to display embedded images and pixels? You may expect the sender to learn that you’ve read the email, and which device you used to read it. But in a new paper we find that privacy risks of email tracking extend far beyond senders knowing when emails are viewed. Opening an email can trigger requests to tens of third parties, and many of these requests contain your email address. This allows those third parties to track you across the web and connect your online activities to your email address, rather than just to a pseudonymous cookie.

Illustrative example. Consider an email from the deals website LivingSocial (see details of the example email). When the email is opened, client will make requests to 24 third parties across 29 third-party domains.[1] A total of 10 third parties receive an MD5 hash of the user’s email address, including major data brokers Datalogix and Acxiom. Nearly all of the third parties (22 of the 24) set or receive cookies with their requests. In a webmail client the cookies are the same browser cookies used to track users on the web, and indeed many major web trackers (including domains belonging to Google, comScore, Adobe, and AOL) are loaded when the email is opened. While this example email has a large number of trackers relative to the average email in our corpus, the majority of emails (70%) embed at least one tracker.

How it works. Email tracking is possible because modern graphical email clients allow rendering a subset of HTML. JavaScript is invariably stripped, but embedded images and stylesheets are allowed. These are downloaded and rendered by the email client when the user views the email.[2] Crucially, many email clients, and almost all web browsers, in the case of webmail, send third-party cookies with these requests. The email address is leaked by being encoded as a parameter into these third-party URLs.

Diagram showing the process of tracking with email address

When the user opens the email, a tracking pixel from “tracker.com” is loaded. The user’s email address is included as a parameter within the pixel’s URL. The email client here is a web browser, so it automatically sends the tracking cookies for “tracker.com” along with the request. This allows the tracker to create a link between the user’s cookie and her email address. Later, when the user browses a news website, the browser sends the same cookie, and thus the new activity can be connected back to the email address. Email addresses are generally unique and persistent identifiers. So email-based tracking can be used for targeting online ads based on offline activity (say, to shoppers who used a loyalty card linked to an email address) and for linking different devices belonging to the same user.

[Read more…]