August 6, 2020

Archives for July 2020

Can the exfiltration of personal data by web trackers be stopped?

by Günes Acar, Steven Englehardt, and Arvind Narayanan.

In a series of posts on this blog in 2017/18, we revealed how web trackers exfiltrate personal information from web pages, browser password managers, form inputs, and the Facebook Login API. Our findings resulted in many fixes and privacy improvements to browsers, websites, third parties, and privacy protection tools. However, the root causes of these privacy failures remain unchanged, because they stem from the design of the web itself. In a paper at the 2020 Privacy Enhancing Technologies Symposium, we recap our findings and propose two potential paths forward.

Root causes of failures

In the two years since our research, no fundamental fixes have been implemented that will stop third parties from exfiltrating personal information. Thus, it remains possible that similar privacy vulnerabilities have been introduced even as the specific ones we identified have mostly been fixed. Here’s why.

The web’s security model allows a website to either fully trust a third party (by including the third-party script in a first party context) or not at all (by isolating the third party resource in an iframe). Unfortunately, this model does not capture the range of trust relationships that we see on the web. Since many third parties cannot provide their services (such as analytics) if they are isolated, the first parties have no choice but to give them full privileges even though they do not trust them fully.

The model also assumes transitivity of trust. But in reality, a user may trust a website and the website may trust a third party, but the user may not trust the third party. 

Another root cause is the economic reality of the web that drives publishers to unthinkingly adopt code from third parties. Here’s a representative quote from the marketing materials of FullStory, one of the third parties we found exfiltrating personal information:

Set-up is a thing of beauty. To get started, place one small snippet of code on your site. That’s it. Forever. Seriously.

A key reason for the popularity of these third parties is that the small publishers that rely on them lack the budget to hire technical experts internally to replicate their functionality. Unfortunately, while it’s possible to effortlessly embed a third party, auditing or even understanding the privacy implications requires time and expertise.

This may also help explain the limited adoption of technical solutions such as Caja or ConScript that better capture the partial trust relationship between first and third parties. Unfortunately, such solutions require the publisher to reason carefully about the privileges and capabilities provided to each individual third party. 

Two potential paths forward

In the absence of a fundamental fix, there are two potential approaches to minimize the risk of large-scale exploitation of these vulnerabilities. One approach is to regularly scan the web. After all, our work suggests that this kind of large-scale detection is possible and that vulnerabilities will be fixed when identified.

The catch is that it’s not clear who will do the thankless, time-consuming, and technically complex work of running regular scans, detecting vulnerabilities and leaks, and informing thousands of affected parties. As researchers, our role is to develop the methods and tools to do so, but running them on a regular basis does not constitute publishable research. While we have undertaken significant efforts to maintain our scanning tool OpenWPM as a resource for the community and make scan data available, regular and comprehensive vulnerability detection requires far more resources than we can afford.

In fact, in addition to researchers, many other groups have missions that are aligned with the goal of large-scale web privacy investigations: makers of privacy-focused browsers and mobile apps, blocklist maintainers, advocacy groups, and investigative journalists. Some examples of efforts that we’re aware of include Mozilla’s maintenance of OpenWPM since 2018, Brave’s research on tracking protection, and DuckDuckGo’s dataset about trackers. However, so far, these efforts fall short of addressing the full range of web privacy vulnerabilities, and data exfiltration in particular. Perhaps a coalition of these groups can create the necessary momentum.

The second approach is legal, namely, regulation that incentivizes first parties to take responsibility for the privacy violations on their websites. This is in contrast to the reactive approach above which essentially shifts the cost of privacy to public-interest groups or other external entities.

Indeed, our findings represent potential violations of existing laws, notably the GDPR in the EU and sectoral laws such as HIPAA (healthcare) and FERPA (education) in the United States. However, it is unclear whether the first parties or the third parties would be liable. According to several session replay companies, they are merely data processors and not joint controllers under the GDPR. In addition, since there are a large number of first parties exposing user data and each violation is relatively small in scope, regulators have not paid much attention to these types of privacy failures. In fact, they risk effectively normalizing such privacy violations due to the lack of enforcement. Thus, either stepped-up enforcement of existing laws or new laws that establish stricter rules could shift incentives, resulting in stronger preventive measures and regular vulnerability scanning by web developers themselves. 

Either approach will be an uphill battle. Regularly scanning the web requires a source of funding, whereas stronger regulation and enforcement will raise the cost of running websites and will face opposition from third-party service providers reliant on the current lax approach. But fighting uphill battles is nothing new for privacy advocates and researchers.

Update: Here’s the video of Günes Acar’s talk about this research at PETS 2020.