June 28, 2017

Web Census Notebook: A new tool for studying web privacy

As part of the Web Transparency and Accountability Project, we’ve been visiting the web’s top 1 million sites every month using our open-source privacy measurement tool OpenWPM. This has led to numerous worrying findings such as the systematic abuse of newly introduced web features for fingerprinting, leading to better privacy tools and occasionally strong responses from browser vendors.

Enabling research is great — OpenWPM has led to 14 papers so far — but research is slow and requires expertise. To make our work more directly useful, today we’re announcing a new tool to study web privacy: a Jupyter notebook interface and a set of libraries to quickly answer most questions about web tracking by querying the the 500 GB of data we collect every month.

Jupyter notebook is an intuitive tool for data analysis using Python, and it’s what we use here internally for much of our own research. Notebooks are accessible with a simple web interface; yet the code, data, and memory persists on the server if you close the browser and return to it later (even from a different device). Notebooks combine code with visualizations, making them ideal for data exploration and analysis.

Who could benefit from this tool? We envision uses such as these:

  • Publishers could use our data to understand third-party tracking on their own websites.
  • Journalists could use our data to investigate and expose privacy-infringing practices.
  • Regulators and enforcement agencies could use our tool in investigations.
  • Creators of browser privacy tools could use our data to test their effectiveness.

Let’s look at an example that shows the feel of the interface. The code below computes the average number of embedded trackers on the top 100 websites in various categories such as “news” and “shopping”. It is intuitive and succinct. Without our interface, not only would the SQL version of this query be much more cumbersome, but it would require a ton of legwork and setup to even get to a point where you can write the query. Now you just need to point your browser at our notebook.

    for category, domains in census.first_parties.alexa_categories.items():
        avg = sum(1 for first_party in domains[:100]
                    for third_party in first_party.third_party_resources
                    if third_party.is_tracker) / 100
        print("Average number of trackers on %s sites: %.1f" % (category, avg))

The results confirm our finding that news sites have the most trackers, and adult sites the least. [1]

Here’s what happens behind the scenes:

  • census is a Python object that exposes all the relationships between websites and third parties as object attributes, hiding the messy details of the underlying database schema. Each first party is represented by a FirstParty object that gives access to each third-party resource (URI object) on the first party, and the ThirdParty that the URI belongs to. When the objects are accessed, they are instantiated automatically by querying the database.
  • census.first_parties is a container of FirstParty objects ordered by Alexa traffic rank, so you can easily analyze the top sites, or sites in the long tail, or specific sites. You can also easily slice the sites by category: in the example above, we iterate through each category of census.first_parties.alexa_categories.
  • There’s a fair bit of logic that goes into analyzing the crawl data which third parties are embedded on which websites, and cross-referencing that with tracking-protection lists to figure out which of those are trackers. This work is already done for you, and exposed via attributes such as ThirdParty.is_tracker.

Since the notebooks run on our server, we expect to be able to support only a limited number (a few dozen) at this point, so you need to apply for access. The tool is currently in beta as we smooth out rough edges and add features, but it is usable and useful. Of course, you’re welcome to run the notebook on your own server — the underlying crawl datasets are public, and we’ll release the code behind the notebooks soon. We hope you find this of use to you, and we welcome your feedback.

 

[1] The linked graph from our paper measures the number of distinct domains whereas the query above counts every instance of every tracker. The trends are the same in both cases, but the numbers are different. Here’s the output of the query:

 

Average number of third party trackers on computers sites: 41.0
Average number of third party trackers on regional sites: 68.8
Average number of third party trackers on recreation sites: 58.2
Average number of third party trackers on health sites: 38.4
Average number of third party trackers on news sites: 151.2
Average number of third party trackers on business sites: 55.0
Average number of third party trackers on kids_and_teens sites: 74.8
Average number of third party trackers on home sites: 94.5
Average number of third party trackers on arts sites: 108.6
Average number of third party trackers on sports sites: 86.6
Average number of third party trackers on reference sites: 43.8
Average number of third party trackers on science sites: 43.1
Average number of third party trackers on society sites: 73.5
Average number of third party trackers on shopping sites: 53.1
Average number of third party trackers on adult sites: 16.8
Average number of third party trackers on games sites: 70.5

Breaking, fixing, and extending zero-knowledge contingent payments

The problem of fair exchange arises often in business transactions — especially when those transactions are conducted anonymously over the internet. Alice would like to buy a widget from Bob, but there’s a circular problem: Alice refuses to pay Bob until she receives the widget whereas Bob refuses to send Alice the widget until he receives payment.

In a previous post, we described the fair-exchange problem and solutions for buying physical goods using Bitcoin. Today is the first of two posts in which we focus on purchasing digital goods and services. In a new paper together with researchers from City University of New York and IMDEA Software Institute, Madrid, we show that Zero-knowledge contingent payments (ZKCP), a well known protocol for the fair exchange of digital goods over the blockchain is insecure in its current form. We show how to fix ZKCP, and also extend it to a new class of problems. [Read more…]

What does it mean to ask for an “explainable” algorithm?

One of the standard critiques of using algorithms for decision-making about people, and especially for consequential decisions about access to housing, credit, education, and so on, is that the algorithms don’t provide an “explanation” for their results or the results aren’t “interpretable.”  This is a serious issue, but discussions of it are often frustrating. The reason, I think, is that different people mean different things when they ask for an explanation of an algorithm’s results.

Before unpacking the different flavors of explainability, let’s stop for a moment to consider that the alternative to algorithmic decisionmaking is human decisionmaking. And let’s consider the drawbacks of relying on the human brain, a mechanism that is notoriously complex, difficult to understand, and prone to bias. Surely an algorithm is more knowable than a brain. After all, with an algorithm it is possible to examine exactly which inputs factored in to a decision, and every detailed step of how these inputs were used to get to a final result. Brains are inherently less transparent, and no less biased. So why might we still complain that the algorithm does not provide an explanation?

We should also dispense with cases where the algorithm is just inaccurate–where a well-informed analyst can understand the algorithm but will see it as producing answers that are wrong. That is a problem, but it is not a problem of explainability.

So what are people asking for when they say they want an explanation? I can think of at least four types of explainability problems.

The first type of explainability problem is a claim of confidentiality. Somebody knows relevant information about how a decision was made, but they choose to withhold it because they claim it is a trade secret, or that disclosing it would undermine security somehow, or that they simply prefer not to reveal it. This is not a problem with the algorithm, it’s an institutional/legal problem.

The second type of explainability problem is complexity. Here everything about the algorithm is known, but somebody feels that the algorithm is so complex that they cannot understand it. It will always be possible to answer what-if questions, such as how the algorithm’s result would have been different had the person been one year older, or had an extra $1000 of annual income, or had one fewer prior misdemeanor conviction, or whatever. So complexity can only be a barrier to big-picture understanding, not to understanding which factors might have changed a particular person’s outcome.

The third type of explainability problem is unreasonableness. Here the workings of the algorithm are clear, and are justified by statistical evidence, but the result doesn’t seem to make sense. For example, imagine that an algorithm for making credit decisions considers the color of a person’s socks, and this is supported by unimpeachable scientific studies showing that sock color correlates with defaulting on credit, even when controlling for other factors. So the decision to factor in sock color may be justified on a rational basis, but many would find it unreasonable, even if it is not discriminatory in any way. Perhaps this is not a complaint about the algorithm but a complaint about the world–the algorithm is using a fact about the world, but nobody understands why the world is that way. What is difficult to explain in this case is not the algorithm, but the world that it is predicting.

The fourth type of explainability problem is injustice. Here the workings of the algorithm are understood but we think they are unfair, unjust, or morally wrong. In this case, when we say we have not received an explanation, what we really mean is that we have not received an adequate justification for the algorithm’s design.  The problem is not that nobody has explained how the algorithm works or how it arrived at the result it did. Instead, the problem is that it seems impossible to explain how the algorithm is consistent with law or ethics.

It seems useful, when discussing the explanation problem for algorithms, to distinguish these four cases–and any others that people might come up with–so that we can zero in on what the problem is. In the long run, all of these types of complaints are addressable–so that perhaps explainability is not a unique problem for algorithms but rather a set of commonsense principles that any system, algorithmic or not, must attend to.