June 23, 2017

Web Census Notebook: A new tool for studying web privacy

As part of the Web Transparency and Accountability Project, we’ve been visiting the web’s top 1 million sites every month using our open-source privacy measurement tool OpenWPM. This has led to numerous worrying findings such as the systematic abuse of newly introduced web features for fingerprinting, leading to better privacy tools and occasionally strong responses from browser vendors.

Enabling research is great — OpenWPM has led to 14 papers so far — but research is slow and requires expertise. To make our work more directly useful, today we’re announcing a new tool to study web privacy: a Jupyter notebook interface and a set of libraries to quickly answer most questions about web tracking by querying the the 500 GB of data we collect every month.

Jupyter notebook is an intuitive tool for data analysis using Python, and it’s what we use here internally for much of our own research. Notebooks are accessible with a simple web interface; yet the code, data, and memory persists on the server if you close the browser and return to it later (even from a different device). Notebooks combine code with visualizations, making them ideal for data exploration and analysis.

Who could benefit from this tool? We envision uses such as these:

  • Publishers could use our data to understand third-party tracking on their own websites.
  • Journalists could use our data to investigate and expose privacy-infringing practices.
  • Regulators and enforcement agencies could use our tool in investigations.
  • Creators of browser privacy tools could use our data to test their effectiveness.

Let’s look at an example that shows the feel of the interface. The code below computes the average number of embedded trackers on the top 100 websites in various categories such as “news” and “shopping”. It is intuitive and succinct. Without our interface, not only would the SQL version of this query be much more cumbersome, but it would require a ton of legwork and setup to even get to a point where you can write the query. Now you just need to point your browser at our notebook.

    for category, domains in census.first_parties.alexa_categories.items():
        avg = sum(1 for first_party in domains[:100]
                    for third_party in first_party.third_party_resources
                    if third_party.is_tracker) / 100
        print("Average number of trackers on %s sites: %.1f" % (category, avg))

The results confirm our finding that news sites have the most trackers, and adult sites the least. [1]

Here’s what happens behind the scenes:

  • census is a Python object that exposes all the relationships between websites and third parties as object attributes, hiding the messy details of the underlying database schema. Each first party is represented by a FirstParty object that gives access to each third-party resource (URI object) on the first party, and the ThirdParty that the URI belongs to. When the objects are accessed, they are instantiated automatically by querying the database.
  • census.first_parties is a container of FirstParty objects ordered by Alexa traffic rank, so you can easily analyze the top sites, or sites in the long tail, or specific sites. You can also easily slice the sites by category: in the example above, we iterate through each category of census.first_parties.alexa_categories.
  • There’s a fair bit of logic that goes into analyzing the crawl data which third parties are embedded on which websites, and cross-referencing that with tracking-protection lists to figure out which of those are trackers. This work is already done for you, and exposed via attributes such as ThirdParty.is_tracker.

Since the notebooks run on our server, we expect to be able to support only a limited number (a few dozen) at this point, so you need to apply for access. The tool is currently in beta as we smooth out rough edges and add features, but it is usable and useful. Of course, you’re welcome to run the notebook on your own server — the underlying crawl datasets are public, and we’ll release the code behind the notebooks soon. We hope you find this of use to you, and we welcome your feedback.


[1] The linked graph from our paper measures the number of distinct domains whereas the query above counts every instance of every tracker. The trends are the same in both cases, but the numbers are different. Here’s the output of the query:


Average number of third party trackers on computers sites: 41.0
Average number of third party trackers on regional sites: 68.8
Average number of third party trackers on recreation sites: 58.2
Average number of third party trackers on health sites: 38.4
Average number of third party trackers on news sites: 151.2
Average number of third party trackers on business sites: 55.0
Average number of third party trackers on kids_and_teens sites: 74.8
Average number of third party trackers on home sites: 94.5
Average number of third party trackers on arts sites: 108.6
Average number of third party trackers on sports sites: 86.6
Average number of third party trackers on reference sites: 43.8
Average number of third party trackers on science sites: 43.1
Average number of third party trackers on society sites: 73.5
Average number of third party trackers on shopping sites: 53.1
Average number of third party trackers on adult sites: 16.8
Average number of third party trackers on games sites: 70.5

Can Facebook really make ads unblockable?

[This is a joint post with Grant Storey, a Princeton undergraduate who is working with me on a tool to help users understand Facebook’s targeted advertising.]

Facebook announced two days ago that it would make its ads indistinguishable from regular posts, and hence impossible to block. But within hours, the developers of Adblock Plus released an update which enabled the tool to continue blocking Facebook ads. The ball is now back in Facebook’s court. So far, all it’s done is issue a rather petulant statement. The burning question is this: can Facebook really make ads indistinguishable from content? Who ultimately has the upper hand in the ad blocking wars?

There are two reasons — one technical, one legal — why we don’t think Facebook will succeed in making its ads unblockable, if a user really wants to block them.

The technical reason is that the web is an open platform. When you visit facebook.com, Facebook’s server sends your browser the page content along with instructions on how to render them on the screen, but it is entirely up to your browser to follow those instructions. The browser ultimately acts on behalf of the user, and gives you — through extensions — an extraordinary degree of control over its behavior, and in particular, over what gets displayed on the screen. This is what enables the ecosystem of ad-blocking and tracker-blocking extensions to exist, along with extensions for customizing web pages in various other interesting ways.

Indeed, the change that Adblock Plus made in order to block the new, supposedly unblockable ads is just a single line in the tool’s default blocklist:

facebook.com##div[id^="substream_"] div[id^="hyperfeed_story_id_"][data-xt]

What’s happening here is that Facebook’s HTML code for ads has slight differences from the code for regular posts, so that Facebook can keep things straight for its own internal purposes. But because of the open nature of the web, Facebook is forced to expose these differences to the browser and to extensions such as Adblock Plus. The line of code above allows Adblock Plus to distinguish the two categories by exploiting those differences.

Facebook engineers could try harder to obfuscate the differences. For example, they could use non-human-readable element IDs to make it harder to figure out what’s going on, or even randomize the IDs on every page load. We’re surprised they’re not already doing this, given the grandiose announcement of the company’s intent to bypass ad blockers. But there’s a limit to what Facebook can do. Ultimately, Facebook’s human users have to be able to tell ads apart, because failure to clearly distinguish ads from regular posts would run headlong into the Federal Trade Commission’s rules against misleading advertising — rules that the commission enforces vigorously. [1, 2] And that’s the second reason why we think Facebook is barking up the wrong tree.

Facebook does allow human users to easily recognize ads: currently, ads say “Sponsored” and have a drop-down with various ad-related functions, including a link to the Ad Preferences page. And that means someone could create an ad-blocking tool that looks at exactly the information that a human user would look at. Such a tool would be mostly immune to Facebook’s attempts to make the HTML code of ads and non-ads indistinguishable. Again, the open nature of the web means that blocking tools will always have the ability to scan posts for text suggestive of ads, links to Ad Preferences pages, and other markers.

But don’t take our word for it: take our code for it instead. We’ve created a prototype tool that detects Facebook ads without relying on hidden HTML code to distinguish them. [Update: the source code is here.] The extension examines each post in the user’s news feed and marks those with the “Sponsored” link as ads. This is a simple proof of concept, but the detection method could easily be made much more robust without incurring a performance penalty. Since our tool is for demonstration purposes, it doesn’t block ads but instead marks them as shown in the image below.  

All of this must be utterly obvious to the smart engineers at Facebook, so the whole “unblockable ads” PR push seems likely to be a big bluff. But why? One possibility is that it’s part of a plan to make ad blockers look like the bad guys. Hand in hand, the company seems to be making a good-faith effort to make ads more relevant and give users more control over them. Facebook also points out, correctly, that its ads don’t contain active code and aren’t delivered from third-party servers, and therefore aren’t as susceptible to malware.

Facebook does deserve kudos for trying to clean up and improve the ad experience. If there is any hope for a peaceful resolution to the ad blocking wars, it is that ads won’t be so annoying as to push people to install ad blockers, and will be actually useful at least some of the time. If anyone can pull this off, it is Facebook, with the depth of data it has about its users. But is Facebook’s move too little, too late? On most of the rest of the web, ads continue to be creepy malware-ridden performance hogs, which means people will continue to install ad blockers, and as long as it is technically feasible for ad blockers to block Facebook ads, they’re going to continue to do so. Let’s hope there’s a way out of this spiral.

[1] Obligatory disclaimer: we’re not lawyers.

[2] Facebook claims that Adblock Plus’s updates “don’t just block ads but also posts from friends and Pages”. What they’re most likely referring to that Adblock Plus blocks ads that are triggered by one of your friends Liking the advertiser’s page. But these are still ads: somebody paid for them to appear in your feed. Facebook is trying to blur the distinction in its press statement, but it can’t do that in its user interface, because that is exactly what the FTC prohibits.

The workshop on Data and Algorithmic Transparency

From online advertising to Uber to predictive policing, algorithmic systems powered by personal data affect more and more of our lives. As our society begins to grapple with the consequences of this shift, empirical investigation of these systems has proved vital to understand the potential for discrimination, privacy breaches, and vulnerability to manipulation.

This emerging field of research, which we’re calling Data and Algorithmic Transparency, seems poised to grow dramatically. But it faces a number of methodological challenges which can only be solved by bringing together expertise from a variety of disciplines. That is why Alan Mislove and I are organizing the first workshop on Data and Algorithmic Transparency at Columbia University on Nov 19, 2016.

Here are three reasons you should participate in this workshop.

  1. Start of a new, interdisciplinary community. The set of disciplines represented on the Program Committee is strikingly diverse: Internet measurement, information privacy/security, computer systems, human-computer interaction, law, and media studies. Industrial research and government are also represented. We expect the workshop itself to have a similar mix of participants, and that is exactly what is needed to make transparency research a success. Alan and I (and others including Nikolaos Laoutaris) are committed to growing and nurturing this community over the next several years.
  1. Co-located with two other exciting events: the Data Transparency Lab conference (DTL ‘16) and the Fairness, Accountability, and Transparency in Machine Learning workshop (FAT-ML ‘16). DTL shares many of the goals of the DAT workshop, but is non-academic. FAT-ML has a complementary relationship with the goals of DAT: it seeks to develop machine learning techniques for developers of algorithmic systems to improve fairness and accountability, whereas DAT seeks to analyze existing systems, typically “from the outside”. The events are consecutive and non-overlapping, and participants of each event are encouraged to attend the others.
  1. A format that makes the most of everyone’s time. At most computer science conferences, each speaker mumbles through their slides while the audience is a sea of laptops, awaiting their turn. DAT will be the opposite. We plan to have paper discussions instead of paper presentations, with commenters and participants, rather than authors, doing most of the speaking about each paper. This first edition of DAT will be non-archival (but peer-reviewed), and one goal of the discussions is to help authors improve their papers for later publication. We are also soliciting talk proposals about already published work; groups of accepted talks will be organized into panels.

See you in New York City!