May 20, 2018

Website operators are in the dark about privacy violations by third-party scripts

by Steven Englehardt, Gunes Acar, and Arvind Narayanan.

Recently we revealed that “session replay” scripts on websites record everything you do, like someone looking over your shoulder, and send it to third-party servers. This en-masse data exfiltration inevitably scoops up sensitive, personal information — in real time, as you type it. We released the data behind our findings, including a list of 8,000 sites on which we observed session-replay scripts recording user data.

As one case study of these 8,000 sites, we found health conditions and prescription data being exfiltrated from walgreens.com. These are considered Protected Health Information under HIPAA. The number of affected sites is immense; contacting all of them and quantifying the severity of the privacy problems is beyond our means. We encourage you to check out our data release and hold your favorite websites accountable.

Student data exfiltration on Gradescope

As one example, a pair of researchers at UC San Diego read our study and then noticed that Gradescope, a website they used for grading assignments, embeds FullStory, one of the session replay scripts we analyzed. We investigated, and sure enough, we found that student names and emails, student grades, and instructor comments on students were being sent to FullStory’s servers. This is considered Student Data under FERPA (US educational privacy law). Ironically, Princeton’s own Information Security course was also affected. We notified Gradescope of our findings, and they removed FullStory from their website within a few hours.

You might wonder how the companies’ privacy policies square with our finding. As best as we can tell, Gradescope’s Terms of Service actually permit this data exfiltration [1], which is a telling comment about the ineffectiveness of Terms of Service as a way of regulating privacy.

FullStory’s Terms are a different matter, and include a clause stating: “Customer agrees that it will not provide any Sensitive Data to FullStory.” We argued previously that this repudiation of responsibility by session-replay scripts puts website operators in an impossible position, because preventing data leaks might require re-engineering the site substantially, negating the core value proposition of these services, which is drag-and-drop deployment. Interestingly, Gradescope’s CEO told us that they were not aware of this requirement in FullStory’s Terms, that the clause had not existed when they first signed up for FullStory, and that they (Gradescope) had not been notified when the Terms changed. [2]

Web publishers kept in the dark

Of the four websites we highlighted in our previous post and this one (Bonobos, Walgreens, Lenovo, and Gradescope), three have removed the third-party scripts in question (all except Lenovo). As far as we can tell, no publisher (website operator) was aware of the exfiltration of sensitive data on their own sites until our study. Further, as mentioned above, Gradescope was unaware of key provisions in FullStory’s Terms of Service. This is a pattern we’ve noticed over and over again in our six years of doing web privacy research.

Worse, in many cases the publisher has no direct relationship with the offending third-party script. In Part 2 of our study we examined two third-party scripts which exploit a vulnerability in browsers’ built-in password managers to exfiltrate user identities. One web developer was unable to determine how the script was loaded and asked us for help. We pointed out that their site loaded an ad network (media-clic.com), which in turn loaded “themoneytizer.com”, which finally loaded the offending script from Audience Insights. These chains of redirects are ubiquitous on the web, and might involve half a dozen third parties. On some websites the majority of third parties have no direct relationship with the publisher.

Most of the advertising and analytics industry is premised on keeping not just users but also website operators in the dark about privacy violations. Indeed, the effort required by website operators to fully audit third parties would negate much of the benefit of offloading tasks to them. The ad tech industry creates a tremendous negative externality in terms of the privacy cost to users.

Can we turn the tables?

The silver lining is that if we can explain to web developers what third parties are doing on their sites, and empower them to take control, that might be one of the most effective ways to improve web privacy. But any such endeavor should keep in mind that web publishers everywhere are on tight budgets and may not have much privacy expertise.

To make things concrete, here’s a proposal for how to achieve this kind of impact:

  • Create a 1-pager summarizing the bare minimum that website operators need to know about web security, privacy, and third parties, with pointers to more information.
  • Create a tailored privacy report for each website based on data that is already publicly available through various sources including our own data releases.
  • Build open-source tools for website operators to scan their own sites [3]. Ideally, the tool should make recommendations for privacy-protecting changes based on the known behavior of third parties.
  • Reach out to website operators to provide information and help make changes. This step doesn’t scale, but is crucial.

If you’re interested in working with us on this, we’d love to hear from you!

Endnotes

We are grateful to UCSD researchers Dimitar Bounov and Sorin Lerner for bringing the vulnerabilities on Gradescope.com to our attention.

[1] Gradescope’s terms of use state: “By submitting Student Data to Gradescope, you consent to allow Gradescope to provide access to Student Data to its employees and to certain third party service providers which have a legitimate need to access such information in connection with their responsibilities in providing the Service.”

[2] The Wayback Machine does not archive FullStory’s Terms page far enough back in time for us to independently verify Gradescope’s statement, nor does FullStory appear in ToSBack, the EFF’s terms-of-service tracker.

[3] Privacyscore.org is one example of a nascent attempt at such a tool.

Comments

  1. Browsers usually have a master javascript toggle, and sometimes a way to whitelist sites. But, I don’t see any way to turn off just third-party javascript (it is usually an option to not allow 3rd party cookies, though). Is there a reason for this?

    • Andrew McConachie says:

      I use the Noscript plugin for Firefox which allows this. I also still run Firefox 56 specifically because Noscript on FF 57 isn’t fully featured yet.

      On another note I am continually amazed at how crappy web development practices are. I am not a full time web dev and never have been, mostly because I find the act of developing for the modern web miserable and unfulfilling. However, when I have developed web pages, either for myself or because someone paid me to do so, I have never drawn scripts from other domains. I always make sure to copy any needed JS scripts from their hosted websites to my own website and serve them from there. This I do if only for the simple reason of version control. As a developer, if you’re pulling code from 20 different sites how do you know what any of them will do? How can you possibly control that or know when something will break?

      The privacy violations mentioned in this article are just the tip of the iceberg when it comes to the loading of 3rd party scripts. There are more basic quality problems with this approach to development that reult in web pages just not working sometimes. Put simply, web development is just crap, it keeps getting worse, and there is very little we can do about it.

    • I do not see any technical and other reason why browsers should not provide an “execute third-party js” option, the same way they do with cookies. Good website programming practice should be to load all essential scripts for a page to work fro the original website. As the authors suggest elsewhere, website admins do not have the awareness of non-technical issues such as privacy and focus on “getting things to work”, sacrificing everything else. Unfortunately the whole Internet is based on the feasibility for things to be done, ignoring the future repercussions.

      At the end, it becomes an issue of national policies and enforcement. If one needs to make a credit card transaction and needs to enable javascript execution, it is either getting things done and have one’s cession recording or paying a late fee, for example.

      Being that we live in the digital Wild Wild Wild west, I can not be but pessimistic about national and international policy enforcement.

      The Princeton TAP approach and civil society lobbying against bad practices is the most effective tool for now. Institutions will need to conform ethical standards if they want to keep customers.

      Martin

  2. Seems like it would be a useful feature, although there would need to be a whitelist for popular CDNs like ajax.googleapis.com, code.jquery.com, and cdnjs.com. I’ll note that this very freedom-to-tinker page loads New Relic which is a super invasive drag on performance, although many ad blockers and privacy filters are astute enough to remove it. This page also loads Google Analytics but that is a super useful tool for web authors to use in understanding their audience so I have no problem with it being used.

    • Arvind Narayanan says:

      Ha! Thank you for pointing this out. We don’t run Freedom to Tinker ourselves but we’ll get in touch with the people who do.