September 25, 2017

A Peek at A/B Testing in the Wild

[Dillon Reisman was previously an undergraduate at Princeton when he worked on a neat study of the surveillance implications of cookies. Now he’s working with the WebTAP project again in a research + engineering role. — Arvind Narayanan]

In 2014, Facebook revealed that they had manipulated users’ news feeds for the sake of a psychology study looking at users’ emotions. We happen to know about this particular experiment because it was the subject of a publicly-released academic paper, but websites do “A/B testing” every day that is completely opaque to the end-user. Of course, A/B testing is often innocuous (say, to find a pleasing color scheme), but the point remains that the user rarely has any way of knowing in what ways their browsing experience is being modified, or why their experience is being changed in particular.

By testing websites over time and in a variety of conditions we could hope to discover how users’ browsing experience is manipulated in not-so-obvious ways. But one third-party service actually makes A/B testing and user tracking human-readable — no reverse-engineering or experimentation necessary! This is the widely-used A/B testing provider Optimizely; Jonathan Mayer had told us it would be an interesting target of study.* Their service is designed to expose in easily-parsable form how its clients segment users and run experiments on them directly in the JavaScript they embed on websites. In other words, if example.com uses Optimizely, the entire logic used by example.com for A/B testing is revealed to every visitor of example.com.

That means that the data collected by our large-scale web crawler OpenWPM contains the details of all the experiments that are being run across the web using Optimizely. In this post I’ll show you some interesting things we found by analyzing this data. We’ve also built a Chrome extension, Pessimizely, that you can download so you too can see a website’s Optimizely experiments. When a website uses Optimizely, the extension will alert you and attempt to highlight any elements on the page that may be subject to an experiment. If you visit nytimes.com, it will also show you alternative news headlines when you hover over a title. I suggest you give it a try!

 

The New York Times website, with headlines that may be subject to an experiment highlighted by Pessimizely.

 

The Optimizely Scripts

Our OpenWPM web crawler collects and stores javascript embedded on every page it visits. This makes it straightforward to make a query for every page that uses Optimizely and grab and analyze the code they get from Optimizely. Once collected, we investigated the scripts through regular expression-matching and manual analysis.


  "4495903114": {
      "code": …
      "name": "100000004129417_1452199599
               [A.] New York to Appoint Civilian to Monitor Police Surveillance -- 
               [B.] Sued Over Spying on Muslims, New York Police Get Oversight",
      "variation_ids": ["4479602534","4479602535"],
      "urls": [{
        "match": "simple",
        "value": "http://www.nytimes.com"
      }],
      "enabled_variation_ids": ["4479602534","4479602535"]
    },

An example of an experiment from nytimes.com that is A/B testing two variations of a headline in a link to an article.

From a crawl of the top 100k sites in January 2016, we found and studied 3,306 different websites that use Optimizely. The Optimizely script for each site contains a data object that defines:

  1. How the website owner wants to divide users into “audiences,” based on any number of parameters like location, cookies, or user-agent.
  2. Experiments that the users might experience, and what audiences should be targeted with what experiments.

The Optimizely script reads from the data object and then executes a javascript payload and sets cookies depending on if the user is in an experimental condition. The site owner populates the data object through Optimizely’s web interface – who on a website’s development team can access that interface and what they can do is a question for the site owner. The developer also helpfully provides names for their user audiences and experiments.

In total, we found around 51,471 experiments on the 3,306 websites in our dataset that use Optimizely. On average each website has approximately 15.2 experiments, and each experiment has about 2.4 possible variations. We have only scratched the surface of some of the interesting things sites use A/B testing for, and here I’ll share a couple of the more interesting examples:

 

News publishers test the headlines users see, with differences that impact the tone of the article

A widespread use of Optimizely among news publishers is “headline testing.” To use an actual recent example from the nytimes.com, a link to an article headlined:

“Turkey’s Prime Minister Quits in Rift With President”

…to a different user might appear as…

“Premier to Quit Amid Turkey’s Authoritarian Turn.”

The second headline suggests a much less neutral take on the news than the first. That sort of difference can paint a user’s perception of the article before they’ve read a single word. We found other examples of similarly politically-sensitive headlines changing, like the following from pjmedia.com:

“Judge Rules Sandy Hook Families Can Proceed with Lawsuit Against Remington”

…could appear to some users as…

“Second Amendment Under Assault by Sandy Hook Judge.”

While editorial concerns might inform how news publishers change headlines, it’s clear that a major motivation behind headline testing is the need to drive clicks. A third variation we found for the Sandy Hook headline above is the much vaguer sounding “Huge Development in Sandy Hook Gun Case.” The Wrap, an entertainment news outlet, experimented with replacing “Disney, Paramount  Had Zero LGBT Characters in Movies Last Year” with the more obviously “click-baity” headline “See Which 2 Major Studios Had Zero LGBT Characters in 2015 Movies.”

We were able to identify 17 different news websites in our crawl that in the past have done some form of headline testing. This is most likely an undercount in our crawl — most of these 17 websites use Optimizely’s integrations with other third-party platforms like Parse.ly and WordPress for their headline testing, making them more easily identified. The New York Times website, for instance, implements its own headline testing code.

Another limitation of what we’ve found so far is that the crawls that we analyzed only visit the homepage of each site. The OpenWPM crawler could be configured, however, to browse links from within a site’s homepage and collect data from those pages. A broader study of the practices of news publishers could use the tool to drill down deeper into news sites and study their headlines over time.

 

Websites identify and categorize users based on money and affluence

Many websites target users based on IP and geolocation. But when IP/geolocation are combined with notions of money the result is surprising. The website of a popular fitness tracker targets users that originate from a list of six hard-coded IP addresses labelled “IP addresses Spending more than $1000.” Two of the IP addresses appear to be larger enterprise customers — a medical research institute a prominent news magazine. Three belong to unidentified Comcast customers. These big-spending IP addresses were targeted in the past with an experiment presented the user a button that either suggested the user “learn more” about a particular product or “buy now.”

Connectify, a large vendor of networking software, uses geolocation on a coarser level — they label visitors from the US, Australia, UK, Canada, Netherlands, Switzerland, Denmark, and New Zealand as coming from “Countries that are Likely to Pay.”

Non-profit websites also experiment with money. charity: water (charitywater.org) and the Human Rights Campaign (hrc.org) both have experiments defined to change the default donation amount a user might see in a pre-filled text box.

 

Web developers use third-party tools for more than just their intended use

A developer following the path of least resistance might use Optimizely to do other parts of their job simply because it is the easiest tool available. Some of the more exceptional “experiments” deployed by websites are simple bug-fixes, described with titles like, “[HOTFIX][Core Commerce] Fix broken sign in link on empty cart,” or “Fix- Footer links errors 404.” Other experiments betray the haphazard nature of web development, with titles like “delete me,” “Please Delete this Experiment too,” or “#Bugfix.”

We might see these unusual uses because Optimizely allows developers to edit and rollout new code with little engineering overhead. With the inclusion of one third-party script, a developer can leverage the Optimizely web interface to do a task that might otherwise take more time or careful testing. This is one example of how third-parties have evolved to become integral to the entire functionality and development of the web, raising security and privacy concerns.

 

The need for transparency

Much of the web is curated by inscrutable algorithms running on servers, and a concerted research effort is needed to shed light on the less-visible practices of websites. Thanks to the Optimizely platform we can at least peek into that secret world.

We believe, however, that transparency should be the default on the web — not the accidental product of one third-party’s engineering decisions. Privacy policies are a start, but they generally only cover a website’s data collection and third-party usage on a coarse level. The New York Times Privacy Policy, for instance, does not even suggest that headline testing is something they might do, despite how it could drastically alter your consumption of the news. If websites had to publish more information about what third-parties they use and how they use them, regulators could use that information to better protect consumers on the web. Considering the potentially harmful effects of how websites might use third-parties, more transparency and oversight is essential.

 

@dillonthehuman


* This was a conversation a year ago, when Jonathan was a grad student at Stanford.

The Princeton Web Census: a 1-million-site measurement and analysis of web privacy

Web privacy measurement — observing websites and services to detect, characterize, and quantify privacy impacting behaviors — has repeatedly forced companies to improve their privacy practices due to public pressure, press coverage, and regulatory action. In previous blog posts I’ve analyzed why our 2014 collaboration with KU Leuven researchers studying canvas fingerprinting was successful, and discussed why repeated, large-scale measurement is necessary.

Today I’m pleased to release initial analysis results from our monthly, 1-million-site measurement. This is the largest and most detailed measurement of online tracking to date, including measurements for stateful (cookie-based) and stateless (fingerprinting-based) tracking, the effect of browser privacy tools, and “cookie syncing”.  These results represent a snapshot of web tracking, but the analysis is part of an effort to collect data on a monthly basis and analyze the evolution of web tracking and privacy over time.

Our measurement platform used for this study, OpenWPM, is already open source. Today, we’re making the datasets for this analysis available for download by the public. You can find download instructions on our study’s website.

New findings

We provide background information and summary of each of our main findings on our study’s website. The paper goes into even greater detail and provides the methodological details on the measurement and analysis of each finding. One of our more surprising findings was the discovery of two apparent attempts to use the HTML5 Audio API for fingerprinting.

The figure is a visualization of the audio processing executed on users’ browsers by third-party fingerprinting scripts. We found two different AudioNode configurations in use. In both configurations an audio signal is generated by an oscillator and the resulting signal is hashed to create an identifier. Initial testing shows that the techniques may have some limitations when used for fingerprinting, but further analysis is necessary. You can help us with that (and test your own device) by using our demonstration page here.

See the paper for our analysis of a consolidated third-party ecosystem, the effects of third parties on HTTPS adoption, and examine the performance of tracking protection tools. In addition to audio fingerprinting, we show that canvas fingerprint is being used by more third parties, but on less sites; that a WebRTC feature can and is being used for tracking; and how the HTML Canvas is being used to discover user’s fonts.

What’s next? We are exploring ways to share our data and analysis tools in a form that’s useful to a wider and less technical audience. As we continue to collect data, we will also perform longitudinal analyses of web tracking. In other ongoing research, we’re using the data we’ve collected to train machine-learning models to automatically detect tracking and fingerprinting.

The Web Privacy Problem is a Transparency Problem: Introducing the OpenWPM measurement tool

In a previous blog post I explored the success of our study, The Web Never Forgets, in having a positive impact on web privacy. To ensure a lasting impact, we’ve been doing monthly, automated 1-million-site measurement of tracking and privacy. Soon we’ll be releasing these datasets and our findings. But in this post I’d like to introduce our web measurement platform OpenWPM that we’ve built for this purpose. OpenWPM has been quickly gaining adoption — it has already been used by at least 6 other research groups, as well as journalists, regulators, and students for class projects. In this post, I’ll explain why we built OpenWPM, describe a previously unmeasured type of tracking we found using the tool, and show you how you can participate and contribute to this community effort.

This post is based on a talk I gave at the FTC’s PrivacyCon. You can watch the video online here.

Why monthly, large-scale measurements are necessary

In my previous post, I showed how measurements from academic studies can help improve online privacy, but I also pointed out how they can fall short. Measurement results often have an immediate impact on online privacy. Unless that impact leads to a technical, policy, or legal solution, the impact will taper off over time as the measurements age.

Technical solutions do not always exist for privacy violations. I discussed how canvas fingerprinting can’t be prevented without sacrificing usability in my previous blog post, but there are others as well. For example, it has proven difficult to find a satisfactory solution to the privacy concerns surrounding WebRTC’s access to local IPs. This is also highlighted in the unofficial W3C draft on Fingerprinting Guidance for Web Specification Authors, which states: “elimination of the capability of browser fingerprinting by a determined adversary through solely technical means that are widely deployed is implausible.”

It seems inevitable that measurement results will go out of date, for two reasons. Most obviously, there is a high engineering cost to running privacy studies. Equally important is the fact that academic papers in this area are published as much for their methodological novelty as for their measurement results. Updating the results of an earlier study is unlikely to lead to a publication, which takes away the incentive to do it at all. [1]

OpenWPM: our platform for automated, large-scale web privacy measurements

We built OpenWPM (Github, technical report), a generic platform for online tracking measurement. It provides the stability and instrumentation necessary to run many online privacy studies. Our goal in developing OpenWPM is to decrease the initial engineering cost of studies and make running a measurement as effortless as possible. It has already been used in several published studies from multiple institutions to detect and reverse engineer online tracking.

OpenWPM also makes it possible to run large-scale measurements with Firefox, a real consumer browser [2]. Large scale measurement lets us compare the privacy practices of the most popular sites to those in the long tail. This is especially important when observing the use of a tracking technique highlighted in a measurement study. For example, we can check if it’s removed from popular sites but added to less popular sites.

Transparency through measurement, on 1 million sites

We are using OpenWPM to run the Princeton Transparency Census, a monthly web-scale measurement of tracking techniques and privacy issues, comprising 1 million sites. With it, we will be able to detect and measure many of the known privacy violations reported by researchers so far: the use of stateful tracking mechanisms, browser fingerprinting, cookie synchronization, and more.

During the measurements, we’ll collect data in three categories: (1)  network traffic — all HTTP requests and response headers (2) client-side state — cookies, Flash cookies, etc. (3) execution traces — we trap and record targeted JavaScript API calls that have been known to be used for tracking. In addition to releasing all of the raw data collected during the census, we’ll release the results of our own automated analysis.

Alongside the 1 million site measurement, we are also running smaller, targeted measurements with different browser configurations. Examples include crawling deeper into the site or browsing with a privacy extension, such as Ghostery or AdBlock Plus. These smaller crawls will provide additional insight into the privacy threats faced by real users.

Detecting WebRTC local IP discovery

As a case study of the ease of introducing a new measurement into the infrastructure, I’ll walk through the steps I took to measure scripts using WebRTC to discover a machine’s local IP address [3]. For machines behind a home router, this technique may reveal an IP of the form 192.168.1.*. Users of corporate or university networks may return a unique local IP address from within that organization’s IP range.

A user’s local IP address adds additional information to a browser fingerprint. For example, it can be used as a way to differentiate multiple users behind a NAT without requiring browser state. The amount of identifying information it provides for the average user hasn’t been studied. However, both Chrome and Firefox [4] have implemented opt-in solutions to prevent the technique. The first reported use that I could find for this technique in the wild was a third-party on nytimes.com in July 2015.

After examining a demo script, I decided to record all property access and all method calls of the RTCPeerConnection interface, the primary interface for WebRTC. The additional instrumentation necessary for this interface is just a single line of Javascript in OpenWPM’s Firefox extension.

A preliminary analysis [5] of a 50,000 site pilot measurement from October 2015 suggests that WebRTC local IP discovery is used on the homepages of over 100 sites, from over 20 distinct scripts. Only 1 of these scripts would be blocked by EasyList or EasyPrivacy.

How can this be useful for you

We envision several ways researchers and other members of the community can make use of  OpenWPM and our measurements. I’ve listed them here from least involved to most involved.

(1) Use our measurement data for their own tools. In my analysis of canvas fingerprinting I mentioned that Disconnect incorporated our research results into their blocklist. We want to make it easy for privacy tools to make use of the analysis we run, by releasing analysis data in a structured, machine readable way.

(2) Use the data collected during our measurements, and build their own analysis on top of it. We know we’ll never be able to take the human element out of these studies. Detection methodologies will change, new features of the browser will be released and others will change. The depth of the Transparency measurements should make it easy test new ideas, with the option of contributing them back to the regular crawls.

(3) Use OpenWPM to collect and release their own data. This is the model we see most web privacy researchers opting for, and a model we plan to use for most of our own studies. The platform can be used and tweaked as necessary for the individual study, and the measurement results and data can be shared publicly after the study is complete.

(4) Contribute to OpenWPM through pull requests. This is the deepest level of involvement we see. Other developers can write new features into the infrastructure for their own studies or to be run as part of our transparency measurements. Contributions here will benefit all users of OpenWPM.

Over the coming months we will release new blog posts and research results on the measurements I’ve discussed in this post. You can follow our progress here on Freedom to Tinker, on Twitter @s_englehardt, and on our Github repository.

 

[1] Notable exceptions include the study of cookie respawning: 2009, 2011, 2011, 2014. and the statistics on stateful tracking use and third-party inclusion: 2009, 2012, 2012, 2012, 2015.

[2] Crawling with a real browser is important for two reasons: (1) it’s less likely to be detected as a bot, meaning we’re less likely to receive different treatment from a normal user, and (2) a real browser supports all the modern web features (e.g. WebRTC, HTML5 audio and video), plugins (e.g. Flash), and extensions (e.g. Ghostery, HTTPS Everywhere). Many of these additional features play a large role in the average user’s privacy online.

[3] There is an additional concern that WebRTC can be used to determine a VPN user’s actual IP address, however this attack is distinct from the one described in this post.

[4] uBlock Origin also provides an option to prevent WebRTC local IP discovery on Firefox.

[5] We are in the process of running and verifying this analysis on a our 1 million site measurements, and will release an updated analysis with more details in the future.