June 21, 2018

No boundaries: Exfiltration of personal data by session-replay scripts

This is the first post in our “No Boundaries” series, in which we reveal how third-party scripts on websites have been extracting personal information in increasingly intrusive ways. [0]
by Steven Englehardt, Gunes Acar, and Arvind Narayanan

Update: we’ve released our data — the list of sites with session-replay scripts, and the sites where we’ve confirmed recording by third parties.

You may know that most websites have third-party analytics scripts that record which pages you visit and the searches you make.  But lately, more and more sites use “session replay” scripts. These scripts record your keystrokes, mouse movements, and scrolling behavior, along with the entire contents of the pages you visit, and send them to third-party servers. Unlike typical analytics services that provide aggregate statistics, these scripts are intended for the recording and playback of individual browsing sessions, as if someone is looking over your shoulder.

The stated purpose of this data collection includes gathering insights into how users interact with websites and discovering broken or confusing pages. However the extent of data collected by these services far exceeds user expectations [1]; text typed into forms is collected before the user submits the form, and precise mouse movements are saved, all without any visual indication to the user. This data can’t reasonably be expected to be kept anonymous. In fact, some companies allow publishers to explicitly link recordings to a user’s real identity.

For this study we analyzed seven of the top session replay companies (based on their relative popularity in our measurements [2]). The services studied are Yandex, FullStory, Hotjar, UserReplay, Smartlook, Clicktale, and SessionCam. We found these services in use on 482 of the Alexa top 50,000 sites.


This video shows the co-browse” feature of one company, where the publisher can watch user sessions live.

What can go wrong? In short, a lot.

Collection of page content by third-party replay scripts may cause sensitive information such as medical conditions, credit card details and other personal information displayed on a page to leak to the third-party as part of the recording. This may expose users to identity theft, online scams, and other unwanted behavior. The same is true for the collection of user inputs during checkout and registration processes.

The replay services offer a combination of manual and automatic redaction tools that allow publishers to exclude sensitive information from recordings. However, in order for leaks to be avoided, publishers would need to diligently check and scrub all pages which display or accept user information. For dynamically generated sites, this process would involve inspecting the underlying web application’s server-side code. Further, this process would need to be repeated every time a site is updated or the web application that powers the site is changed.

A thorough redaction process is actually a requirement for several of the recording services, which explicitly forbid the collection of user data. This negates the core premise of these session replay scripts, who market themselves as plug and play. For example, Hotjar’s homepage advertises: “Set up Hotjar with one script in a matter of seconds” and Smartlook’s sign-up procedure features their script tag next to a timer with the tagline “every minute you lose is a lot of video”.

To better understand the effectiveness of these redaction practices, we set up test pages and installed replay scripts from six of the seven companies [3]. From the results of these tests, as well as an analysis of a number of live sites, we highlight four types of vulnerabilities below:

1. Passwords are included in session recordings. All of the services studied attempt to prevent password leaks by automatically excluding password input fields from recordings. However, mobile-friendly login boxes that use text inputs to store unmasked passwords are not redacted by this rule, unless the publisher manually adds redaction tags to exclude them. We found at least one website where the password entered into a registration form leaked to SessionCam, even if the form is never submitted.

2. Sensitive user inputs are redacted in a partial and imperfect way. As users interact with a site they will provide sensitive data during account creation, while making a purchase, or while searching the site. Session recording scripts can use keystroke or input element loggers to collect this data.

All of the companies studied offer some mitigation through automated redaction, but the coverage offered varies greatly by provider. UserReplay and SessionCam replace all user input with an equivalent length masking text, while FullStory, Hotjar, and Smartlook exclude specific input fields by type. We summarize the redaction of other fields in the table below.

summary of automated redaction features offered by each service

Summary of the automated redaction features for form inputs enabled by default from each company.
Filled circle: Data is excluded; Half-filled circle: equivalent length masking; Empty circle: Data is sent in the clear
* UserReplay sends the last 4 digits of the credit card field in plain text
† Hotjar masks the street address portion of the address field.

 

Automated redaction is imperfect; fields are redacted by input element type or heuristics, which may not always match the implementation used by publishers. For example, FullStory redacts credit card fields with the `autocomplete` attribute set to `cc-number`, but will collect any credit card numbers included in forms without this attribute.

Credit card data leaking on Bonobos checkout page

The account page of the clothing store Bonobos leaks full credit card details to FullStory. The screenshot of Chrome’s network inspector shows the leaked data being sent letter-by-letter as it is typed. The user’s full credit card number, expiration, CVV number, name, and billing address are leaked on this page. Email address and gift card numbers are among the other types of data leaked on Bonobos site.

To supplement automated redaction, several of the session recording companies, including Smartlook, Yandex, FullStory, SessionCam, and Hotjar allow sites to further specify inputs elements to be excluded from the recording. To effectively deploy these mitigations a publisher will need to actively audit every input element to determine if it contains personal data. This is complicated, error prone and costly, especially as a site or the underlying web application code changes over time. For instance, the financial service site fidelity.com has several redaction rules for Clicktale that involve nested tables and child elements referenced by their index. In the next section we further explore these challenges.

A safer approach would be to mask or redact all inputs by default, as is done by UserReplay and SessionCam, and allow whitelisting of known-safe values. Even fully masked inputs provide imperfect protection. For example, the masking used by UserReplay and Smartlook leaks the length of the user’s password

3. Manual redaction of personally identifying information displayed on a page is a fundamentally insecure model. In addition to collecting user inputs, the session recording companies also collect rendered page content. Unlike user input recording, none of the companies appear to provide automated redaction of displayed content by default; all displayed content in our tests ended up leaking.

Instead, session recording companies expect sites to manually label all personally identifying information included in a rendered page. Sensitive user data has a number of avenues to end up in recordings, and small leaks over several pages can lead to a large accumulation of personal data in a single session recording.

For recordings to be completely free of personal information, a site’s web application developers would need to work with the site’s marketing and analytics teams to iteratively scrub personally identifying information from recordings as it’s discovered. Any change to the site design, such as a change in the class attribute of an element containing sensitive information or a decision to load private data into a different type of element requires a review of the redaction rules.

As a case study, we examine the pharmacy section of Walgreens.com, which embeds FullStory. Walgreens makes extensive use of manual redaction for both displayed and input data. Despite this, we find that sensitive information including medical conditions and prescriptions are leaked to FullStory alongside the names of users.

Walgreens prescription request page leaks prescription information

The above image shows a prescription request for the anti-depressant drug, Zoloft. During the process of creating the request, the name of the prescribed drug is leaked to FullStory [4]. Manual redaction was used to exclude the user’s name, their doctor’s name, and the quantity of medicine from the recording (marked in the image by a striped overlay). However, the user’s full name was leaked earlier in the process (not shown in this image), which allows anyone with access to the recording to associate this prescription with the user’s real identity.

Walgreens health history page leaks health conditions

Walgreens allows users to enter their “Health History”, which can include other prescriptions and health conditions that may be relevant to prescription requests. During this process, most of the user’s personal and health information are excluded from FullStory’s recording through manual redaction. However, the process leaks the selected medicine and health conditions, the latter of which is shown above.

Walgreens identity verification page leaks answers to questions

During account signup, Walgreens requires a user to verify their identity by asking a standard set of identity verification questions. The selection options for these questions, which may reveal the user’s personal information, are displayed on the page and are transferred to FullStory. Additionally, the mouse tracking feature of FullStory will likely reveal the user’s selection, even though the radio button selection is redacted. The inclusion of this data in recordings directly contradicts the statement at the top of the page: “Walgreens does not retain this data and cannot access or view your answers”.

We do not present the above examples to point fingers at a certain website. Instead, we aim to show that the redaction process can fail even for a large publisher with a strong, legal incentive to protect user data. We observed similar personal information leaks on other websites, including on the checkout pages of Lenovo [5]. Sites with less resources or less expertise are even more likely to fail.

4. Recording services may fail to protect user data. Recording services increase the exposure to data breaches, as personal data will inevitably end up in recordings. These services must handle recording data with the same security practices with which a publisher would be expected to handle user data.

We provide a specific example of how recording services can fail to do so. Once a session recording is complete, publishers can review it using a dashboard provided by the recording service. The publisher dashboards for Yandex, Hotjar, and Smartlook all deliver playbacks within an HTTP page, even for recordings which take place on HTTPS pages. This allows an active man-in-the-middle to injecting a script into the playback page and extract all of the recording data. Worse yet, Yandex and Hotjar deliver the publisher page content over HTTP — data that was previously protected by HTTPS is now vulnerable to passive network surveillance.

The vulnerabilities we highlight above are inherent to full-page session recording. That’s not to say the specific examples can’t be fixed — indeed, the publishers we examined can patch their leaks of user data and passwords. The recording services can all use HTTPS during playbacks. But as long as the security of user data relies on publishers fully redacting their sites, these underlying vulnerabilities will continue to exist.

Does tracking protection help?

Two commonly used ad-blocking lists EasyList and EasyPrivacy do not block FullStory, Smartlook, or UserReplay scripts. EasyPrivacy has filter rules that block Yandex, Hotjar, ClickTale and SessionCam.

At least one of the five companies we studied (UserReplay) allows publishers to disable data collection from users who have Do Not Track (DNT) set in their browsers. We scanned the configuration settings of the Alexa top 1 million publishers using UserReplay on their homepages, and found that none of them chose to honor the DNT signal.

Improving user experience is a critical task for publishers. However it shouldn’t come at the expense of user privacy.


End notes:

[0] We use the term ‘exfiltrate’ in this series to refer to the third-party data collection that we study. The term ‘leakage’ is sometimes used, but we eschew it, because it suggests an accidental collection resulting from a bug. Rather, our research suggests that while not necessarily malicious, the collection of sensitive personal data by the third parties that we study is inherent in their operation and is well known to most if not all of these entities. Further, there is an element of furtiveness; these data flows are not public knowledge and neither publishers nor third parties are transparent about them.

[1] A recent analysis of the company Navistone, completed by Hill and Mattu for Gizmodo, explores how data collection prior to form submission exceeds user expectations. In this study, we show how analytics companies collect far more user data with minimal disclosure to the user. In fact, some services suggest the first party sites simply include a disclaimer in their site’s privacy policy or terms of service.

[2] We used OpenWPM to crawl the Alexa top 50,000 sites, visiting the homepage and 5 additional internal pages on each site. We use a two-step approach to detect analytics services which collect page content.

First, we inject a unique value into the HTML of the page and search for evidence of that value being sent to a third party in the page traffic. To detect values that may be encoded or hashed we use a detection methodology similar to previous work on email tracking. After filtering out leak recipients, we isolate pages on which at least one third party receives a large amount of data during the visit, but for which we do not detect a unique ID. On these sites, we perform a follow-up crawl which injects a 200KB chunk of data into the page and check if we observe a corresponding bump in the size of the data sent to the third party.

We found 482 sites on which either the unique marker was leaked to a collection endpoint from one of the services or on which we observed a data collection increase roughly equivalent to the compressed length of the injected chunk. We believe this value is a lower bound since many of the recording services offer the ability to sample page visits, which is compounded by our two-step methodology.

[3] One company (Clicktale) was excluded because we were unable to make the practical arrangements to analyze script’s functionality at scale.

[4] FullStory’s terms and conditions explicitly classify health or medical information, or any other information covered by HIPAA as sensitive data and asks customers to “not provide any Sensitive Data to FullStory.”

[5] Lenovo.com is another example of a site which leaks user data in session recordings.

Lenovo's checkout process leaks shipping and payment information.

On the final page of Lenovo’s checkout procedure, the user’s billing, shipping, and payment information is included in the text of the page. This information is thus included in the page source collected by FullStory as part of the recording process.

[6] We used the default scripts available to new accounts for 5 of the 6 providers. For UserReplay, we used a script taken from a live site and verified that the configuration options match the most common options found on the web.

Comments

  1. Just great!

    Definitely want to look into this more. But since you’ve done that already, can a consumer block this?

  2. You probably already know this, but ForeSee does this too. If you ever get one of those “Thanks for visiting” popups on a web store that asks you if you want to take a survey, there’s a good chance that your entire session has been recorded. Choosing to take the survey is considered “consent” for that recording to be retained, even though it’s not mentioned anywhere in the popup. The data is sent to their servers in compressed binary form (zlib?), so it’s not as obvious as with some frameworks.

    • Arvind Narayanan says:

      Thanks for the suggestion! Our methods should be able to catch exfiltration even if the data is compressed, but for the present study we only tested the seven most popular providers and ForeSee wasn’t one of them.

  3. Ford Fisher says:

    This must stop! Now!

  4. Thoughts on Quantum Metric? They seem to be a new startup in the space.

  5. OK.

    What do I do to block all of these? I do my own DNS blocking and would consider writing code to process data passing through a machine.

    Related: What is a good way to identify sites and pages guilty of this?

  6. Jeremy L. Gaddis says:

    Great article, thank you!

    You state in footnote 0:

    “The term ‘leakage’ is sometimes used, but we eschew it, because it suggests an accidental collection resulting from a bug.”

    Yet, unfortunately, you continue to use the word “leak” or “leaked” several times throughout the article. Please replace all occurrences with “exfiltrate” — as used at the beginning of the article — in order to accurate reflect the true nature of what is happening here.

    The data is not leaking. It really is being exfiltrated and sent to third-parties. Using “leaked” minimizes this and makes it appear less severe or serious.

    • Arvind Narayanan says:

      Yes, we agree. We’ll be more careful with the terminology in future posts in this series.

  7. Is there a published paper that goes with this that has additional detail of the process and findings? If so please add a link for where others may download and view it. Especially if your research is publicly funded this would seem a reasonable request. Thank you for considering.

    Your opening line suggests there may only be a series of articles, in trickle feed format that may or may not share all your results; I will watch for those but wanted to ask the above anyway.

    Thanks for helping others understand the extent of their exposure on the web.

  8. Gary Coryer says:

    So, why aren’t these companies being prosecuted for sharing passwords? Isn’t that a violation of Federal Data Protection laws??

  9. So, all my passwords, which the company stores only the hash so it can’t be stolen, are openly stored at some 3rd company tracking DB.
    Awesome (not)

  10. It’s for this sort of reason that I use Ghostery, a browser plugin that detects and optionally blocks third party website scripts. It’s detected as many as 40 third party scripts running on some websites. (Not affiliated with them, just a happy user.)

  11. Justin Jackson says:

    That’s well known problem in EU.
    For example, in INCIBE Security conference (spanish), you can see a demo from 2015 about this here:
    https://youtu.be/X2HE44m8u4A?t=2m26s

    Using Open Web Analytics (opensource, also well known and not in your report)

  12. Who sponsored the study?

    • Steven Englehardt says:

      We are funded by an NSF grant (CNS 1526353). For a portion of this work I was funded by a fellowship from Princeton University. Some of our measurements were funded by an Amazon AWS Cloud Credits for Research grant.

  13. hmoobgolian says:

    Can you provide a list of all known session replay companies and root domain names? I’d like to block all of them through a browser add-on since I can’t opt-out.

  14. IaMaNoNyMoUs says:

    Have you come across IBM Tealeaf, it’s a session replay tool I’ve used in the past but I am wondering why it’s not in your list? I believe it was one of the first and the biggest supplier… I’d love to hear about them or are you not allowed to mention a large industry player like IBM??

    • Steven Englehardt says:

      Thanks for your question. The list of scripts included in the blog post or data release should not be considered comprehensive. There are a number of technical reasons why a script may fail to be included: obfuscated data flows (i.e. data flows in a format we don’t support), page collection size limits, or user sampling. Likewise, our methodology detects page source collection by third parties regardless of the intention, meaning we also discovered a number of non-session-replay scripts using page source for other purposes.

      IBM’s Tealeaf service was detected during our measurement. It, and a number of other parties, will be further analyzed in our upcoming paper.

  15. ManInTheCorner says:

    What about the possible leakage of data from one tab to another. So say you are on a work web site an type in a password while you have a tab open to Wallgreens. Waht are the chances that that data is also being captured?

    • Steven Englehardt says:

      This is not possible. A recording script will only be able to record within the tab it’s embedded.

      • By design that may be, but could someone with malicious intent modify a script to watch other tabs?

        • Steven Englehardt says:

          The browser prevents scripts from being able to do this. A malicious extension could do cross-tab monitoring, but that’s out of the scope of what we examined.

  16. 1 – Thanks for watching the big brother(s) who is/are watching us.
    2 – “Speak Your Mind,” you ask. I read all above and … my mind is frozen – I am speechless. I was aware of collecting my personal data but not to such extent.

  17. Scripts from Hell says:

    You put a lot of effort into your report. Thank you! Unfortunately, you do not show any details about how the monitoring is being conducted. Your method of selecting web sites is biased. And details like “200KB” look dilettantish (that should read either “200 kB” or “200 KiB”).

  18. What is the best way for your simple computer user to prevent this? Ghostery has been mentioned above. Would no script do the same?
    Thanks

    • Steven Englehardt says:

      NoScript configured to block all scripts would do the same, since the recording scripts will not run.

      A blocker which includes the EasyPrivacy blocklist, like uBlock Origin, will also block most of the parties mentioned in this post.

  19. Willy Luegenpresse says:

    I find it interesting to see that there are so many “news makers” who are monitoring their readers.

  20. Study sponsored by Google?

    Whilst the free Google Analytics – running on this page – may be relatively benign… Google sells a full-fat version that’s more invasive – https://www.google.com/analytics/analytics/features/

    Anybody worried by such things can install Ghostery or equivalent.

    No experience of others but Yandex Metrica is an excellent analytics tool which gives website owners ability to not record personal data.

    “To prevent information from specified fields from being recorded, set the CSS class -metrika-nokeys for them. This class can be used for marking fields with private information.”

    • Tim Schäfer says:

      “Anybody worried by such things can install Ghostery or equivalent.”

      And I am sure a lot of people who read this will do exactly that. Nevertheless, this service is obviously morally flawed. You can see it by the fact that websites stopped using it after this study. Even the people who spy, not only those who are spied on, feel that it is wrong to do this.

  21. If these applications trasfer credit card data together with CVV in plaintext, its time to inform the big credit card companies like Mastercard and Visa about the implementations on big websides.

    I think they will stop this spook faster as you think, on both sides, the websides and the companies of this “spyware”.
    For example the retention of the cvv is illegal for payment service providers, these f.c.ers send it in clear text over the internet.

    The keyword is PCI

  22. The fix for this isnt people dealing with this by loading and using specific software to block this sort of behaviour.

    Companies and their websites should not be allowed to do this by law. Its plain wrong…

  23. Besides NoScript, uBlock Origin, uMatrix either in aggregate or not in different Firefox profiles, I’m using this addon on good faith:
    NoProfile by Dennis M. Heine

    Blocks functions used for psychological profiling and tracking by ad networks.

    Companys have started profiling user behaviour, like scrolling or moving the mouse, NoProfile blocks this.
    Blocked functions are:
    -mouseover/out/enter/leave
    -scroll

  24. How about Brave web browser. It has a set of shields that I had not seen on other browsers…

  25. Why worry about sites, which you can avoid revisiting when your phone and computer is doing it non stop with everything? https://apple.stackexchange.com/questions/157424/what-are-api-smoot-apple-com-and-other-hosts-my-iphone-is-secretly-talking-to

  26. On your list of sites one finds: 23069 amnesty.org hotjar.com analytics script exists

    amnesty.org is the Amnesty International website. Surely this is an error. Why would an organisation that defends privacy use software to gather detailed information about persons visiting its site?

    • Steven Englehardt says:

      For the site your reference we only find evidence that a hotjar script is loaded on the page (hence the “analytics script exists tag”). We did not find evidence of session recording. For more information, check out the description at the top of the data release page.

  27. Interesting people mention Ghostery when their parent company, Evidon, sells your data to advertisers. It also contains two trackers owned by Google and Yahoo (see https://reports.exodus-privacy.eu.org/reports/178/). I stopped using it for that reason.

  28. George Capehart says:

    When Ghostery is set up, the user is asked if she is willing to share usage and crash data with Ghostery. It is an opt-in process. The two trackers you mention are used to capture the usage and crash data. I haven’t looked at the Ghostery code to see when those trackers are used, but I would assume that if the user opts not to share their data with Ghostery, they are probably not used . . .

  29. You mentioned fidelity.com and Walgreen.com in the above report, but I can’t find those in the zipped CSV file for the full list of the released data. Have they stopped doing that since your reporting?

    • Steven Englehardt says:

      Thanks for reporting this issue! Due to a technical error, we incorrectly excluded some sites while generating the output list, which included fidelity.com and walgreens.com. We’ve corrected the bug and posted the updated list.

  30. Arthur Edelstein says:

    Thanks for this fascinating work! I tried to think of some mitigations browsers could implement:

    1. Perhaps browsers could hide passwords from (third-party) scripts. This would involve hiding the contents of password fields and also censoring KeyboardEvents when the password field has focus.

    2. Maybe Content-Security-Policy could be made finer-grained to prevent unwanted exfiltration. For example, being able to restrict which parts of the dom a third-party script has access to might help.

    Do these make sense? Are there other browser mitigations you would propose?

    • Steven Englehardt says:

      Thanks Arthur! I like your suggestions, and would love to see browsers implement mitigations.

      I’m not sure that it’s possible to differentiate between first and third party scripts after a script has been loaded into a context? At that point it has all the privileges of first-party code, and I don’t know of any way to reliably trace the source of subsequent javascript calls which occur in that context. The call stack will help, but inline code makes that unreliable. I struggled with this when trying to determine whether or not a tracking protection list could be used to selectively restrict API access for scripts loaded from tracking origins (https://bugzilla.mozilla.org/show_bug.cgi?id=1298207). There are proposals that could support this type of access control (http://www.scs.stanford.edu/~deian/pubs/stefan:2014:protecting.pdf), but they haven’t been adopted.

      So without a way to differentiate first and third party within a single context, it might be possible to restrict all scripts from being able to access an element (something like the `writeonly` attribute proposed in: https://mikewest.github.io/credentialmanagement/writeonly/). This might break some sites which handle passwords in an unexpected way.

      As a first step, I think it might be helpful to simply provide a more visible indicator to interested users (and developers) of what’s happening on a page that contains sensitive inputs. For example, display script URLs in a Ghostery-style notification for scripts which do things like register event listeners for keypresses and mouse movements on the top elements, or register blur or change handlers on password inputs. The URL displayed (first vs third party) might be imperfect for the reasons described above, but that information can still be helpful in making it easier to audit a page.

  31. I was contracted to develop this type of activity tracking / session reproduction back in 2010, so far from being a new thing this has been going on for quite a long time. I can’t (or won’t) say who I did it for, as it’s pretty irrelevant. One thing I did put in when I coded it (though it would have been trivial for someone to remove) was to disable the keyboard tracking when a password field was active – though obviously that wouldn’t have helped if someone accidentally typed their password when not in the correct field.

    In a totally unrelated note (as if), I’ve been running ad-blockers and IP blockers for over seven years. I have almost no social media presence, and am very careful not to let anything link between the real world and online. FB for instance has my gender and name correct, but every other detail that it’s come up with is completely wrong, and I have the annoying habit (apparently) of moving the mouse around the screen and clicking randomly when browsing normally 😛

    This is something that has got valid uses (such as education for teachers to see what’s going on with multiple screens – and not using bandwidth hogging screen sharing), but it’s generally a bad thing built to support the fraud that is the advertising market…

  32. Thank you for the informative article and all the work behind it.