September 12, 2024

Do privacy studies help? A Retrospective look at Canvas Fingerprinting

It seems like every month we hear of some new online privacy violation in the news, on topics such as fingerprinting or web tracking. Many of these news stories highlight academic research. What we don’t see is whether these studies and the subsequent news stories have any impact on privacy.

Our 2014 canvas fingerprinting measurement offers an opportunity for me to provide that insight, as we ended up receiving a surprising amount of press coverage after releasing the paper. In this post I’ll examine the reaction to the paper and explore which aspects contributed to its positive impact on privacy. I’ll also explore how we can use this knowledge when designing future studies to maximize their impact.

What we found in 2014

The 2014 measurement paper, The Web Never Forgets, is a collaboration with researchers at KU Leuven. In it, we measured the prevalence of three persistent tracking techniques online: canvas fingerprinting, cookie respawning, and cookie syncing [1]. They are persistent in that are hard to control, hard to detect, and resilient to blocking or removing.

We found that 5% of the top 100,000 sites were utilizing the HTML5 Canvas API as a fingerprinting vector. The overwhelming majority of which, 97%, was caused by the top two providers. The ability to use the HTML5 Canvas as a fingerprinting vector was first introduced in a 2012 paper by Mowery and Shacham. In the time between that 2012 paper and our 2014 measurement, approximately 20 sites and trackers started using canvas to fingerprint their visitors.

Several examples of the text written to the canvas for fingerprinting purposes. Each of these images would be converted to strings and then hashed to create an identifier.

The reaction to our study

Shortly after we released our paper publicly, we saw a significant amount of press coverage, including articles on ProPublica, BBC, Der Spiegel, and more. The amount of coverage our paper received was a surprise for us; we weren’t the creators of the method, and we certainly weren’t the first to report on the fingerprintability of browsers [2]. Just two days later, AddThis stopped using canvas fingerprinting. The second largest provider at the time, Ligatus, also stopped using the technique.

As can be expected, we saw many users take their frustrations to Twitter. There are users who wondered why publishers would fingerprint them:

complained about AddThis:

and expressed their dislike for canvas fingerprinting in general:

We even saw a user question as to why Mozilla does not protect against canvas fingerprinting in Firefox:

However a general technical solution which preserves the API’s usefulness and usability doesn’t exist [3]. Instead the best solutions are either blocking the feature or blocking the trackers which use it.

The developer community responded by releasing canvas blocking extensions for Firefox and Chrome, tools which are used by over 18,000 users in total. AdBlockPlus and Disconnect both commented that the large trackers are already on their block lists, with Disconnect mentioning that the additional, lesser-known parties from our study would be added to their lists.

Why was our study so impactful?

Much of the online privacy problem is actually a transparency problem. By default, users have very little information on the privacy practices of the websites they visit, and of the trackers included on those sites. Without this information users are unable to differentiate between sites which take steps to protect their privacy and sites which don’t. This leads to less of an incentive for site owners to protect the privacy of their users, as online privacy often comes at the expense of additional ad revenue or third-party features.

With our study, we were not only able to remove this information asymmetry [4], but were able to do so in a way that was relatable to users. The visual representation of canvas fingerprinting proved particularly helpful in removing that asymmetry of information; it was very intuitive to see how the shapes drawn to a canvas could produce a unique image. The ProPublica article even included a demo where users could see their fingerprint built in real time.

While writing the paper we made it a point to include not only the trackers responsible for fingerprinting, but to also include the sites on which the fingerprinting was taking place. Instead of reading that tracker A was responsible for fingerprinting, they could understand that it occurs when they visit publishers X, Y and Z. If a user is frustrated by a technique, and is only familiar with the tracker responsible, there isn’t much they can do. By knowing the publishers on which the technique is used, they can voice their frustrations or choose to visit alternative sites. Publishes, which have in interest in keeping users, will then have an incentive to change their practices.

The technique wasn’t only news to users, even some site owners were unaware that it was being used on their sites. ProPublica updated their original story with a message from YouPorn stating, “[the website was] completely unaware that AddThis contained a tracking software…”, and had since removed it. This shows that measurement work can even help remove the information asymmetry between trackers and the sites upon which they track.

How are things now?

In a re-run of the measurements in January 2016 [5], I’ve observed that the number of distinct trackers utilizing canvas fingerprinting has more than doubled since our 2014 measurement. While the overall number of publisher sites on which the tracking occurs is still below that of our previous measurement, the use of the technique has at least partially revived since AddThis and Ligatus stopped the practice.

This made me curious if we see similar trends for other tracking techniques. In our 2014 paper we also studied cookie respawning [6]. This technique was well studied in the past, both in 2009 and 2011, making it a good candidate to analyze the longitudinal effects of measurement.  As is the case with our measurement, these studies also received a bit of press coverage when released.

The 2009 study, which found HTTP cookie respawning on 6 of the top 100 sites, resulted in a $2.4 million settlement. The 2011 follow-up study found that the use of respawning decreased to just 2 sites in the top 100, and likewise resulted in a $500 thousand settlement. In 2014 we observed respawning on 7 of the top 100 sites, however none of these sites or trackers were US-based entities. This suggests that lawsuits can have an impact, but that impact may be limited by the global nature of the web.

What we’ve learned

Providing transparency into privacy violations online has the potential for huge impact. We saw users unhappy with the trackers that use canvas fingerprinting, with the sites that include those trackers, and even with the browsers they use to visit those sites. It is important that studies visit a large number of sites, and list those on which the privacy violation occurs.

The pressure of transparency affects the larger players more than the long tail. A tracker which is present on a large number of sites, or is present on sites which receive more traffic is more likely to be the focus of news articles or subject to lawsuits. Indeed, our 2016 measurements support it: we’ve seen a large increase in the number of parties involved, but the increase is limited to parties with a much smaller presence.

In the absence of a lawsuit, policy change, or technical solution, we see that canvas fingerprinting use is beginning to grow again. Without constant monitoring and transparency, level of privacy violations can easily creep back to where they were. A single, well-connected tracker can re-introduce a tracking technique to a large number of first-parties.

The developer community will help, we just need to provide them with the data they need. Our detection methodology served as the foundation for blocking tools, which intercept the same calls we used for detection. The script lists we included in our paper and on our website were incorporated into current blocklists.

In a follow-up post, I’ll discuss the work we’re doing to make regular, large scale measurements of web tracking a reality. I’ll show how the tools we’ve built make it possible to run automated, million site crawls can run every month, and I’ll introduce some new results we’re planning to release.

 

[1] The paper’s website provides a short description of each of these techniques.

[2] See: the EFF’s Panopticlick, and academic papers Cookieless Monster and FPDetective.

[3] For example, adding noise to canvas readouts has the potential to cause problems for non-tracking use cases and can still be defeated by a determined tracker. The Tor Browser’s solution of prompting the user on certain canvas calls does work, however it requires a user’s understanding that the technique can be used for tracking and provides for a less than optimal user experience.

[4] For a nice discussion of information asymmetry and the web: Privacy and the Market for Lemons, or How Websites Are Like Used Cars

[5] These measurements were run using the canvas instrumentation portion of OpenWPM.

[6] For a detailed description of cookie respawning, I suggest reading through Ashkan Soltani’s blog post on the subject.

Thanks to Arvind Narayanan for his helpful comments.

Comments

  1. First, good work on discovering, investigating, and following up on these tracking mechanisms. I don’t think there’s anywhere near enough people working on these issues, especially compared to the number of developers at companies laser-focused on finding better ways to track users in order to “monetize” them for advertising purposes.

    Second, you wrote “However a general technical solution which preserves the API’s usefulness and usability doesn’t exist”. I find this inaccurate; it’s more that any exploration of the possible technical solutions has been limited at best, and both the browser developers and standards bodies are not particularly interested in reversing course on matters of this nature. Or in short, features take priority over privacy and security.

    It’s my understanding that the Canvas tracking relies principally on observing differences in two rendering processes, that of fonts and rasterized 3D scenes.

    Font rendering differs due to the combination of a few factors: (1) scalable vector fonts are used, (2) browsers outsource the entire font rendering algorithm to a platform/OS dependent library, (3) said libraries have a lot of configuration parameters that are sometimes changed by software distributors, users, or both. It’s feels pretty clear that if you invalidate any one of these premises, you can drastically reduce or eliminate the tracking potential of font rendering. Either use raster fonts, or use the same font renderer across all platforms, or decide at the browser/application level what fonts really should look like by default instead of leaving it to diverse user reconfiguration.

    As for the 3D information leak, consider me a skeptic of including any 3D rendering/processing in web browsers. There have been multiple serious security breaches in WebGL and closely related APIs in the few years that they’ve existed already. Giving remote code running in a web browser pretty much any access to the extremely complicated GPU hardware is going to result in security flaws and information leaks. By analogy, it’s not far off from giving web browsers the ability to execute native machine code, except worse, because CPUs are simpler and more regular in design than GPUs. That’s not even getting into the massive size of the DirectX/OpenGL stacks and the complexity of the display driver itself.