June 29, 2017

Web Census Notebook: A new tool for studying web privacy

As part of the Web Transparency and Accountability Project, we’ve been visiting the web’s top 1 million sites every month using our open-source privacy measurement tool OpenWPM. This has led to numerous worrying findings such as the systematic abuse of newly introduced web features for fingerprinting, leading to better privacy tools and occasionally strong responses from browser vendors.

Enabling research is great — OpenWPM has led to 14 papers so far — but research is slow and requires expertise. To make our work more directly useful, today we’re announcing a new tool to study web privacy: a Jupyter notebook interface and a set of libraries to quickly answer most questions about web tracking by querying the the 500 GB of data we collect every month.

Jupyter notebook is an intuitive tool for data analysis using Python, and it’s what we use here internally for much of our own research. Notebooks are accessible with a simple web interface; yet the code, data, and memory persists on the server if you close the browser and return to it later (even from a different device). Notebooks combine code with visualizations, making them ideal for data exploration and analysis.

Who could benefit from this tool? We envision uses such as these:

  • Publishers could use our data to understand third-party tracking on their own websites.
  • Journalists could use our data to investigate and expose privacy-infringing practices.
  • Regulators and enforcement agencies could use our tool in investigations.
  • Creators of browser privacy tools could use our data to test their effectiveness.

Let’s look at an example that shows the feel of the interface. The code below computes the average number of embedded trackers on the top 100 websites in various categories such as “news” and “shopping”. It is intuitive and succinct. Without our interface, not only would the SQL version of this query be much more cumbersome, but it would require a ton of legwork and setup to even get to a point where you can write the query. Now you just need to point your browser at our notebook.

    for category, domains in census.first_parties.alexa_categories.items():
        avg = sum(1 for first_party in domains[:100]
                    for third_party in first_party.third_party_resources
                    if third_party.is_tracker) / 100
        print("Average number of trackers on %s sites: %.1f" % (category, avg))

The results confirm our finding that news sites have the most trackers, and adult sites the least. [1]

Here’s what happens behind the scenes:

  • census is a Python object that exposes all the relationships between websites and third parties as object attributes, hiding the messy details of the underlying database schema. Each first party is represented by a FirstParty object that gives access to each third-party resource (URI object) on the first party, and the ThirdParty that the URI belongs to. When the objects are accessed, they are instantiated automatically by querying the database.
  • census.first_parties is a container of FirstParty objects ordered by Alexa traffic rank, so you can easily analyze the top sites, or sites in the long tail, or specific sites. You can also easily slice the sites by category: in the example above, we iterate through each category of census.first_parties.alexa_categories.
  • There’s a fair bit of logic that goes into analyzing the crawl data which third parties are embedded on which websites, and cross-referencing that with tracking-protection lists to figure out which of those are trackers. This work is already done for you, and exposed via attributes such as ThirdParty.is_tracker.

Since the notebooks run on our server, we expect to be able to support only a limited number (a few dozen) at this point, so you need to apply for access. The tool is currently in beta as we smooth out rough edges and add features, but it is usable and useful. Of course, you’re welcome to run the notebook on your own server — the underlying crawl datasets are public, and we’ll release the code behind the notebooks soon. We hope you find this of use to you, and we welcome your feedback.


[1] The linked graph from our paper measures the number of distinct domains whereas the query above counts every instance of every tracker. The trends are the same in both cases, but the numbers are different. Here’s the output of the query:


Average number of third party trackers on computers sites: 41.0
Average number of third party trackers on regional sites: 68.8
Average number of third party trackers on recreation sites: 58.2
Average number of third party trackers on health sites: 38.4
Average number of third party trackers on news sites: 151.2
Average number of third party trackers on business sites: 55.0
Average number of third party trackers on kids_and_teens sites: 74.8
Average number of third party trackers on home sites: 94.5
Average number of third party trackers on arts sites: 108.6
Average number of third party trackers on sports sites: 86.6
Average number of third party trackers on reference sites: 43.8
Average number of third party trackers on science sites: 43.1
Average number of third party trackers on society sites: 73.5
Average number of third party trackers on shopping sites: 53.1
Average number of third party trackers on adult sites: 16.8
Average number of third party trackers on games sites: 70.5

The future of ad blocking

There’s an ongoing arms race between ad blockers and websites — more and more sites either try to sneak their ads through or force users to disable ad blockers. Most previous discussions have assumed that this is a cat-and-mouse game that will escalate indefinitely. But in a new paper, accompanied by proof-of-concept code, we challenge this claim. We believe that due to the architecture of web browsers, there’s an inherent asymmetry that favors users and ad blockers. We have devised and prototyped several ad blocking techniques that work radically differently from current ones. We don’t claim to have created an undefeatable ad blocker, but we identify an evolving combination of technical and legal factors that will determine the “end game” of the arms race.

Our project began last summer when Facebook announced that it had made ads look just like regular posts, and hence impossible to block. Indeed, Adblock Plus and other mainstream ad blockers have been ineffective on Facebook ever since. But Facebook’s human users have to be able to tell ads apart because of laws against misleading advertising. So we built a tool that detects Facebook ads the same way a human would, deliberately ignoring hidden HTML markup that can be obfuscated. (Adblock Plus, on the other hand, is designed to be able to examine only the markup of web pages and not the content.) Our Chrome extension has several thousand users and continues to be effective.

We’ve built on this early success. Laws against misleading advertising apply not just on Facebook, but everywhere on the web. Due to these laws and in response to public-relations pressure, the online ad industry has developed robust self-regulation that standardizes the disclosure of ads across the web. Once again, ad blockers can exploit this, and that’s what our perceptual ad blocker does. [1]

The second prong of an ad blocking strategy is to deal with websites that try to detect (and in turn block) ad blockers. To do this, we introduce the idea of stealth. The only way that a script on a web page can “see” what’s drawn on the screen is to ask the user’s browser to describe it. But ad blocking extensions can control the browser! Not perfectly, but well enough to get the browser to convincingly lie to the web page script about the very existence of the ad blocker. Our proof-of-concept stealthy ad blocker successfully blocked ads and hid its existence on all 50 websites we looked at that are known to deploy anti-adblocking scripts. Finally, we have also investigated ways to detect and block the ad blocking detection scripts themselves. We found that this is feasible but cumbersome; at any rate, it is unnecessary as long as stealthy ad blocking is successful.

The details of all these techniques get extremely messy, and we encourage the interested reader to check out the paper. While some of the details may change, we’re confident of our long-term assessment. That’s because our techniques are all based on sound computer security principles and because we’ve devised a state diagram that describes the possible actions of websites and ad blockers, bringing much-needed clarity to the analysis and helping ensure that there won’t be completely new techniques coming out of left field in the future.

There’s a final wrinkle: the publishing and advertising industries have put forth a number of creative reasons to argue that ad blockers violate the law, and indeed Adblock Plus has been sued several times (without success so far). We carefully analyzed four bodies of law that may support such legal claims, and conclude that the law does not stand in the way of deploying sophisticated ad blocking techniques. [2] That said, we acknowledge that the ethics of ad blocking are far from clear cut. Our research is about what can be done and not what should be done; we look forward to participating in the ethical debate.

This post was edited to update the link to the paper to the arXiv version (original paper link).

[1] To avoid taking sides on the ethics of ad blocking, we have deliberately stopped short of making our proof-of-concept tool fully functional — it is configured to detect ads but not actually block them.

[2] One of the authors is cyberlaw expert Jonathan Mayer.

Sign up now for the first workshop on Data and Algorithmic Transparency

I’m excited to announce that registration for the first workshop on Data and Algorithmic Transparency is now open. The workshop will take place at NYU on Nov 19. It convenes an emerging interdisciplinary community that seeks transparency and oversight of data-driven algorithmic systems through empirical research.

Despite the short notice of the workshop’s announcement (about six weeks before the submission deadline), we were pleasantly surprised by the number and quality of the submissions that we received. We ended up accepting 15 papers, more than we’d originally planned to, and still had to turn away good papers. The program includes both previously published work and original papers submitted to the workshop, and has just the kind of multidisciplinary mix we were looking for.

We settled on a format that’s different from the norm but probably familiar to many of you. We have five panels, one on each of the five main themes that emerged from the papers. The panels will begin with brief presentations, with the majority of the time devoted to in-depth discussions led by one or two commenters who will have read the papers beforehand and will engage with the authors. We welcome the audience to participate; to enable productive discussion, we encourage you to read or skim the papers beforehand. The previously published papers are available to read; the original papers will be made available in a few days.

I’m very grateful to everyone on our program committee for their hard work in reviewing and selecting papers. We received very positive feedback from authors on the quality of reviews of the original papers, and I was impressed by the work that the committee put in.

Finally, note that the workshop will take place at NYU rather than Columbia as originally announced. We learnt some lessons on the difficulty of finding optimal venues in New York City on a limited budget. Thanks to Solon Barocas and Augustin Chaintreau for their efforts in helping us find a suitable venue!

See you in three weeks, and don’t forget the related and colocated DTL and FAT-ML events.