November 21, 2024

The Traceability of an Anonymous Online Comment

Yesterday, I described a simple scenario where a plaintiff, who is having difficulty identifying an alleged online defamer, could benefit from subpoenaing data held by a third party web service provider. Some third parties—like Facebook in yesterday’s example—know exactly who I am and know whenever I visit or post on other sites. But even when no third party has the whole picture, it may still be possible to identify me indirectly, by combining data from different third parties. This is possible because loading one webpage can potentially trigger dozens of nearly simultaneous web connections to various third party service providers, whose records can then be subpoenaed and correlated.

Suppose that I post an anonymous and potentially defamatory comment on a Boing Boing article, but Boing Boing for some reason is unable to supply the plaintiff with any hints about who I am—not even my IP address. The plaintiff will only know that my comment was posted publicly at “9:42am on Fri. Feb 5.” But as I mentioned yesterday, Boing Boing—like almost every other site on the web—takes advantage of a handful of useful third party web services.

For example, one of these services—for an article that happens to feature video—is an embedded streaming media service that hosts the video that the article refers to. The plaintiff could issue a subpoena to the video service and ask for information about any user that loaded that particular embedded video via Boing Boing around “9:42am on Fri. Feb 5.” There might be one user match or a few user matches, depending on the site’s traffic at the time, but for simplicity, say there is only one match—me. Because the video service tracks each user with a unique persistent cookie, the service can and probably does keep a log of all videos that I have ever loaded from their service, whether or not I actually watched them. The subpoena could give the plaintiff a copy of this log.

In perusing my video logs, the plaintiff may see that I loaded a different video, earlier that week, embedded into an article on TechCrunch. He may notice further that TechCrunch uses Google Analytics. With two more subpoenas—one to TechCrunch and one to Google—and some simple matching up of dates and times from the different logs, the plaintiff can likely rebuild a list of all the other Analytics-enabled websites that I’ve visited, since these will likely be noted in the records tied to my Analytics cookie.

The bottom line: From the moment I first load that video on Boing Boing, the plaintiff gains the power to traverse multiple silos of data, held by independent third party entities, to trace my activities and link my anonymous comment to my web browsing history. Given how heavily I use the web, my browsing history will tell the plaintiff a lot about me, and it will probably be enough to uniquely identify who I am.

But this is just one example of many potential paths that a plaintiff could take to identify me. Recall from yesterday that when I visit Boing Boing, the site quietly forwards my information to the servers of at least 17 other parties. Each one of these 17 is a potential subpoena target in the first round of discovery. The information culled from this first round—most importantly, what other websites I’ve visited and at what times—could inform a second round of subpoenas, targeted to these other now-relevant websites and third parties. From there, as you might already be able to tell, the plaintiff can repeat this data linking process and expand the circle of potentially identifying information.

A recent privacy study from Berkeley shows how far such a strategy might reach. The Berkeley researchers found that nearly all of the top 100 sites on the web contain some sort of “web bug,” another term for the hidden web connection that allows a third party to automatically track a user on the site. Some of these sites will load dozens of web bugs on each page visit, which will litter user data far and wide on third party servers. Moreover, the study found that Google Analytics—by far the most popular website statistics service—was used by more than 70% of all sites they surveyed in March 2009. Once they add other Google-run services like Doubleclick and Adsense into the calculation, this figure rises to 88% of all sites that use some Google service—an astonishingly broad and dominant ability to follow users as they browse the web. But even other smaller, but still popular, third party entities have significant reach across thousands of sites across the web.

The traceability of any given site visitor will still depend on context: the number of third party services used by the site, the popularity of each third party service across the web, the types of identifying data that these parties collect and store, whether the speaker used any online anonymity tools, and many other site-specific factors.

Despite the variability in third party tracing capabilities, the nearly simultaneous connections to a few third party services means that the results of tracing can be combined. By sleuthing through information held in third party dossiers, logs and databases, plaintiffs in John Doe lawsuits will have many more discovery options than they had ever previously imagined.

Comments

  1. Thanks for writing this.

    I always enjoy reading how other people are imagining we surfers are tracked online. I enjoyed this post extra-much because you didn’t try to create any controversy to sell it- something I see all the time on the Web.

    Thanks again!
    jafraldo

  2. All the methods described so far center on tracking IP numbers and/or browser cookies, and methods to obfuscate or engineer around those.

    One hypothetical thing that will be much harder to address that applies to e.g. comments and other written content is content analysis and classification of patterns in writing and grammatical style, idiosyncrasies (characteristic spelling errors, preferential use of words/phrases, metaphors, …), and the content itself – e.g. description of specific events even if partially anonymized or abstracted, as one may find on gossip/whistle blower sites.

    But here we are probably leaving the legal domain. For example, a corporate entity with access to their employee’s and business partners’ writing samples (from business correspondence) may employ hypothetical statistical analysis and data mining methods to correlate this corpus with e.g. posts or comments on blogs criticizing the firm or airing dirty laundry, in order to identify leaks.

    Likewise, other entities can correlate material with known origin to material with unknown origin to generate promising leads.

    Of course, none of this will be court admissible (under current standards), but it doesn’t necessarily have to be. In the whistle blower case, somebody deemed a leak need not be sued but can be cut off from the information flow or “managed out”.

    • On a marginally related note, I was once part of an internal investigation at a company where a former employee was suspected of having incorporated proprietary source code into his own open source product. The trigger was not any specific information about the source code, but the way the product was described and positioned, and of course the known authorship.

      The guy had done a fairly good job of obfuscating the superficial appearance and even rearranged large parts of the control flow at the level of conditional statements etc., but of course he could not obfuscate the bigger-picture operation of the software, and a lot of algorithmic patterns and identifiers complete with characteristic spelling mistakes were still in place.

      I don’t know whether what we unearthed would have been sufficient for a courtroom, but in the end the company decided not to pursue the matter as no financial damages or incentives were involved and it would have pissed off an important client that was at the time employing the guy.

      • This comment together with some other material could probably be used to identify me, by people who know about the above.

  3. There are services like ixquick.com which provide anonymizing proxy service. The result is that all a web site will see is the IP address of the proxy. Yes, there are things this breaks, but that’s a feature, not a bug.

    • As the recent posts on the topic explained, this is not foolproof, as your ISP can be expected to log your proxy communication metadata (DNS lookup, connection times (?)), and the proxy may have logging of its own to which it may be legally obliged or not, and which it of course won’t advertise.

  4. Jon Garfunkel says

    Danielle Citron pointed me to this discussion.

    Two years ago, I tried this very experiment with Concurring Opinions, a well-known legal blog that regularly covers online privacy; its editor Dan Solove, has written few books on it.
    Its SiteMeter account shows the IP addresses of every visitor.
    Dan conceded the risk and asked his readers if they minded… and they didn’t mind enough to ask him to change the configuration.

    Read all about it:
    http://www.concurringopinions.com/archives/2007/12/blogs_and_priva.html

    Jon

    • The Thirteenth Commenter says

      Jon,

      Just for the record, I do not comment at Concurring Opinions.

      This is a voluntary choice. To be clear: No one has ever asked me not to comment there. But the technical configuration of that blog is unfriendly towards me.

    • The way this usually works is that the original page links to an image or another considered-essential resource (Javascript, CSS, …) that the browser will download to render the page, without having a material impact on the page layout, or on the display of the content you are interested in. You can run a local proxy (e.g. privoxy) and block such sites.

      A side effect on the blog operator that is probably not so cool is that it will distort the “page hit” etc. metrics, which will affect the attention economy ranking of the site, and possibly affect and ad/referral income the site generates (if applicable). And of course your IP address will probably still show up in a variety of other logs.

  5. Aren’t methods like this simply defeated by using say Adblock or Noscript or a combination of the two? For my own particular reasons, I don’t allow ANY web analytics sites to connect to my PC (especially Google’s)

    And I suppose if you are REALLY paranoid about being caught… what’s to keep someone from using an open access point using a random MAC address, running a browser in a virtual machine using a linux live-cd, hopping through say 10 different proxy servers?
    (extreme example… I know) …or simply use someone else’s computer while they’re not looking.

    • I agree it’s possible to defeat tracing with the appropriate tools. And I don’t mean to imply that it’s impossible to browse the web anonymously. The problem is that only experts have the know-how to use all of the right tools to browse anonymously, and even with the right tools, managing them all correctly can be nearly a full-time job. The vast majority of users have no idea that this kind of tracing is even possible. They just use their vanilla web browsers that subject them by default to expansive third party tracking.

      On a more technical level, using Adblock and Noscript is probably still insufficient. For instance, a page with an embedded streaming video will still redirect the browser to fetch content from YouTube.com. I doubt there are any Adblock filter lists that block YouTube.com.

      The best way to defeat this kind of tracing is simply to block all third party connections, but you’ll lose a lot of useful functionality if you do.

      • “The best way to defeat this kind of tracing is simply to block all third party connections, but you’ll lose a lot of useful functionality if you do.”

        That clearly depends on the point of view. I originally started running a local filter proxy because websites started overloading their pages with so much stuff that page load time was noticeably affected, even with first world “Silicon Valley” broadband internet, and even without intrusive ads (in those days). The problem is not just bandwidth but connection delay. Privacy is a welcome aspect but not the original motivation. Fortunately privacy enhancement (by blocking) correlates with browser performance improvement.

        I have a four-pronged exclusion approach: Use Mozilla’s “load images only from originating server”, javascript off by default, blocking of “known offenders” on top of the proxy’s default blocked list, plus disabling all the silly plugin download popups. It is remarkably effective. Of course I have to update the list from time to time, which is usually triggered by new slowdowns I notice.

        For “essential” sites that won’t work through the proxy I use a separate browser configuration.

        I would agree that’s probably something most people wouldn’t want to do, but it seems to work for me. But maybe I’m an internet luddite.

  6. Jesse Weinstein says

    If the comment was posted while using a proxy service (i.e. Tor) (or even merely a dynamic IP), and a fresh browser, wouldn’t that break the rest of the chain? Sure, you could subpoena Google (or the video hosting site) to find an IP address, but if it was a Tor node, the list of other places it was used wouldn’t tell you anything useful. Am I missing something here?

    • Tor won’t help prevent this kind of tracing. As you point out, IP addresses can be dynamic or shared (think NAT), so third parties almost always track users with non-IP-based methods—using cookies, fingerprinting, etc. Even if my IP changes all the time, my browser can still be identified uniquely by the server unless I take special extra-precautions to purge cookies, “standardize” my browser, and so on.

      Also, regardless of whether I’m using my actual IP or the proxy’s IP, all of Boing Boing’s third party services will receive an HTTP connection from the same IP address at about the same time. This is enough to correlate these connections and in turn my various third party profiles.

    • Don’t forget that people have been traced through search queries!

      Remember that AOL thingy some years ago? They released “anonymized” info about their clients’ search history. A news team managed to identify somebody.

  7. Ambiguously Anonymous says

    This is an amusing and thought provoking worst-case scenario, but is it actually likely to happen to anyone? Would a judge really grant a string of subpoena’s like this just to identify an anonymous commenter, involving a bunch of third parties and resulting in evidence that seems awfully circumstantial? And would any individual or corporation actually go to this much effort, rather than just deleting the comment?

    I’m doubtful, but also genuinely curious b/c this isn’t a legal issue I’m familiar with. Has this happened? Are there cases where this strategy could have been used but wasn’t? Hope someone can fill me in.

    • It’s a good question, and we don’t have all of the answers. We’ll try to address some of these issues later this week.

  8. Your article sure blows away any sense of security that the “allow cookies only from sites I visit” settings that some browsers provide. You don’t need cookies at all to accomplish what you’re talking about. A site you visit embeds your personal information into the URLs that your browser requests from analytics engines and other sites (such as facebook in your example). Interesting that fetching URLs from multiple sites is such a common function of a piece of HTML that I’m at a loss to think of a way to avoid leaving your data all over the place.