April 25, 2024

The Traceability of an Anonymous Online Comment

Yesterday, I described a simple scenario where a plaintiff, who is having difficulty identifying an alleged online defamer, could benefit from subpoenaing data held by a third party web service provider. Some third parties—like Facebook in yesterday’s example—know exactly who I am and know whenever I visit or post on other sites. But even when no third party has the whole picture, it may still be possible to identify me indirectly, by combining data from different third parties. This is possible because loading one webpage can potentially trigger dozens of nearly simultaneous web connections to various third party service providers, whose records can then be subpoenaed and correlated.

Suppose that I post an anonymous and potentially defamatory comment on a Boing Boing article, but Boing Boing for some reason is unable to supply the plaintiff with any hints about who I am—not even my IP address. The plaintiff will only know that my comment was posted publicly at “9:42am on Fri. Feb 5.” But as I mentioned yesterday, Boing Boing—like almost every other site on the web—takes advantage of a handful of useful third party web services.

For example, one of these services—for an article that happens to feature video—is an embedded streaming media service that hosts the video that the article refers to. The plaintiff could issue a subpoena to the video service and ask for information about any user that loaded that particular embedded video via Boing Boing around “9:42am on Fri. Feb 5.” There might be one user match or a few user matches, depending on the site’s traffic at the time, but for simplicity, say there is only one match—me. Because the video service tracks each user with a unique persistent cookie, the service can and probably does keep a log of all videos that I have ever loaded from their service, whether or not I actually watched them. The subpoena could give the plaintiff a copy of this log.

In perusing my video logs, the plaintiff may see that I loaded a different video, earlier that week, embedded into an article on TechCrunch. He may notice further that TechCrunch uses Google Analytics. With two more subpoenas—one to TechCrunch and one to Google—and some simple matching up of dates and times from the different logs, the plaintiff can likely rebuild a list of all the other Analytics-enabled websites that I’ve visited, since these will likely be noted in the records tied to my Analytics cookie.

The bottom line: From the moment I first load that video on Boing Boing, the plaintiff gains the power to traverse multiple silos of data, held by independent third party entities, to trace my activities and link my anonymous comment to my web browsing history. Given how heavily I use the web, my browsing history will tell the plaintiff a lot about me, and it will probably be enough to uniquely identify who I am.

But this is just one example of many potential paths that a plaintiff could take to identify me. Recall from yesterday that when I visit Boing Boing, the site quietly forwards my information to the servers of at least 17 other parties. Each one of these 17 is a potential subpoena target in the first round of discovery. The information culled from this first round—most importantly, what other websites I’ve visited and at what times—could inform a second round of subpoenas, targeted to these other now-relevant websites and third parties. From there, as you might already be able to tell, the plaintiff can repeat this data linking process and expand the circle of potentially identifying information.

A recent privacy study from Berkeley shows how far such a strategy might reach. The Berkeley researchers found that nearly all of the top 100 sites on the web contain some sort of “web bug,” another term for the hidden web connection that allows a third party to automatically track a user on the site. Some of these sites will load dozens of web bugs on each page visit, which will litter user data far and wide on third party servers. Moreover, the study found that Google Analytics—by far the most popular website statistics service—was used by more than 70% of all sites they surveyed in March 2009. Once they add other Google-run services like Doubleclick and Adsense into the calculation, this figure rises to 88% of all sites that use some Google service—an astonishingly broad and dominant ability to follow users as they browse the web. But even other smaller, but still popular, third party entities have significant reach across thousands of sites across the web.

The traceability of any given site visitor will still depend on context: the number of third party services used by the site, the popularity of each third party service across the web, the types of identifying data that these parties collect and store, whether the speaker used any online anonymity tools, and many other site-specific factors.

Despite the variability in third party tracing capabilities, the nearly simultaneous connections to a few third party services means that the results of tracing can be combined. By sleuthing through information held in third party dossiers, logs and databases, plaintiffs in John Doe lawsuits will have many more discovery options than they had ever previously imagined.

What Third Parties Know About John Doe

As David mentioned in his previous post, plaintiffs’ lawyers in online defamation suits will typically issue a sequence of two “John Doe” subpoenas to try to unmask the identity of anonymous online speakers. The first subpoena goes to the website or content provider where the allegedly defamatory remarks were posted, and the second subpoena is sent to the speaker’s ISP. Both entities—the content provider and the ISP—are natural targets for civil discovery. Their logs together will often contain enough information to trace the remarks back to the speaker’s real identity. But when this isn’t enough to identify the speaker, the discovery process traditionally fails.

Are plaintiffs in these cases out of luck? Not if their lawyers know where else to look.

There are numerous third party web services that may hold just enough clues to reidentify the speaker, even without the help of the content provider or the ISP. The vast majority of websites today depend on third parties to deliver valuable services that would otherwise be too expensive or time-consuming to develop in-house. Services such as online advertising, content distribution and web analytics are almost always handled by specialized servers from third party businesses. As such, a third party can embed its service into a wide variety of sites across the web, allowing it to track users across all the sites where it maintains a presence.

Take for example the popular online blog Boing Boing. Upon loading its main page while recording the HTTP session, I noticed that my browser is automatically redirected to domains owned by no fewer than 17 distinct third party entities: 10 services that engage in advertising or marketing, five that embed media or integrate social networking functionality, and two that provide web analytics. By visiting this single webpage, my digital footprints have been scattered to and collected by at least 17 other online entities that I made no deliberate attempt to contact. And each of these entities will likely have stored a cookie on my web browser, allowing it to identify me uniquely later when I browse to one of its other partner sites. I don’t mean to pick on Boing Boing specifically—taking advantage of third party services is a nearly universal practice on the web today, but it’s exactly this pervasiveness that makes it so likely, if not probable, that all of my digital footprints together could link much of my online activities back to my actual identity.

To make this point concrete, let’s say I post a potentially defamatory remark about someone using a pseudonym in the comments section of a Boing Boing article. It happens that for each article, Boing Boing displays the number of times that the article has been shared on Facebook. In order to fetch the current number, Boing Boing redirects my browser to api.facebook.com to make a real-time query to the Facebook API. Since I happen to be logged in to Facebook at the time of the request, my browser forwards with the query my unique Facebook cookie, which includes information that explicitly identifies me—namely, my e-mail address that doubles as my Facebook username.

In order to integrate a bit of useful social networking functionality, Boing Boing enables Facebook, a third party in this situation, to learn which articles I visit on Boing Boing and the dates and times of my visits. The same is true for Tweetmeme, which can now positively link my Twitter account—which I’m also logged in to—with my Boing Boing visits. Even without an authenticated login, the 15 other third parties present on Boing Boing could track me using any number of different methods, including browser fingerprinting, to build detailed dossiers that slowly begin to piece together who I am.

From the perspective of a plaintiff’s lawyer, even if Boing Boing is unwilling or unable to produce any useful information, these third parties might be able to uniquely identify me as the likely defamer, or at least narrow the list of possible speakers down to a handful of users. But tracing speech is not always this easy. Tomorrow, I’ll discuss more complicated discovery strategies and the extent to which they are technically feasible.

Introducing RECAP: Turning PACER Around

With today’s technologies, government transparency means much more than the chance to read one document at a time. Citizens today expect to be able to download comprehensive government datasets that are machine-processable, open and free. Unfortunately, government is much slower than industry when it comes to adopting new technologies. In recent years, private efforts have helped push government, the legislative and executive branches in particular, toward greater transparency. Thus far, the judiciary has seen relatively little action.

Today, we are excited to announce the public beta release of RECAP, a tool that will help bring an unprecedented level of transparency to the U.S. federal court system. RECAP is a plug-in for the Firefox web browser that makes it easier for users to share documents they have purchased from PACER, the court’s pay-to-play access system. With the plug-in installed, users still have to pay each time they use PACER, but whenever they do retrieve a PACER document, RECAP automatically and effortlessly donates a copy of that document to a public repository hosted at the Internet Archive. The documents in this repository are, in turn, shared with other RECAP users, who will be notified whenever documents they are looking for can be downloaded from the free public repository. RECAP helps users exercise their rights under copyright law, which expressly places government works in the public domain. It also helps users advance the public good by contributing to an extensive and freely available archive of public court documents.

The project’s website, https://www.recapthelaw.org, has all of the details– how to install RECAP, a screencast of the plug-in in action, more discussion of why this issue matters, and a host of other goodies.

The repository already has over one million documents available for free download. Together, with the help of RECAP users, we can recapture truly public access to the court proceedings that give our laws their practical meaning.

Lessons from the Fall of NebuAd

With three Congressional hearings held within the past four months, U.S. legislators have expressed increased concern about the handling of private online information. As Paul Ohm mentioned yesterday, the recent scrutiny has focused mainly on the ability of ISPs to intercept and analyze the online traffic of its users– in a word, surveillance. One of the goals of surveillance for ISPs is to yield new sources of revenue; so when a Silicon Valley startup called NebuAd approached ISPs last spring with its behavioral advertising technology, many were quick to sign on. But by summer’s end, the company had lost all of its ISP partners, their CEO had resigned, and they announced their intention to pursue “more traditional” advertising channels.

How did this happen and what can we learn from this episode?

The trio of high-profile hearings in Congress brought the issue of ISP surveillence into the public spotlight. Despite no new privacy legislation even being proposed in the area, the firm sentiment among the Committees’ members, particularly Rep. Edward “When did you stop beating the consumer?” Markey (D-MA), was enough to spawn more negative PR than the partner ISPs could handle. The lesson here, as it often times is, is that regulation is not the only way, and rarely even the best way, of dealing with bad actors, especially in highly innovative sectors like Internet technology. Proposed regulation of third-party online advertising by the New York State Assembly last year, for example, would have placed an undue compliance burden on legitimate online businesses while providing few tangible privacy benefits. Proponents of net neutrality legislation may want to heed this episode as a cautionary tale, especially in light of Comcast’s recent shift to more reasonable traffic management techniques.

Behind the scenes, the work of investigative technologists was key in substantiating the extent of consumer harm that, I presume, caught the eye of Congress members and their staffers. A damaging report by technologist Robb Topolski, published a month before the first hearing, exposed much of NebuAd’s most egregious practices such as IP packet forgery. Such technical efforts are critical in unveiling opaque consumer harms that may be difficult for lay users to detect themselves. To return to net neutrality, ISP monitoring projects such as EFF’s Switzerland testing tool and others will be essential in keeping network management practices in check. (Incidentally and perhaps not coincidentally, Topolski was also the first to reveal Comcast’s use of TCP reset packets to kill BitTorrent connections.)

ISPs and other online service providers are pushing for industry self-regulation in behavioral advertising, but it is not at all clear whether self-regulation will be sufficient to protect consumer privacy. Indeed, even the FTC favors self-regulatory principles, but the question of what “opt-in” actually means will determine the extent of consumer protection. Self-regulation seems unlikely in any case to protect consumers from unwittingly “opting-in” to traffic monitoring. ISPs have a monetary incentive to enroll their customers into monitoring and standard tricks will probably get the job done. We all have experience signing fine-print contracts without reading them, clicking blindly through browser-based security warnings, or otherwise sacrificing our privacy for trivial rewards and discounts (or even just a bar of chocolate).

Interestingly enough, a parallel fight is being waged in Europe over the exact same issue but with starkly contrasting results. Although Phorm develops online surveillance technologies for targeted advertising similar to NebuAd’s, a UK regulator recently declared that Phorm’s technologies may be able to be introduced “in a lawful, appropriate and transparent fashion” given “the knowledge and agreement of the customer.” As a result, Phorm has continued its trials of their Internet surveillance technology on British Telecom subscribers.

Why these two storylines have diverged so significantly is not apparent to me. One thought is that Phorm got itself in front of the issue of business legitimacy– whereas U.S. regulators saw NebuAd as a rogue business from the start, Phorm has been an active participant on the IAB’s Behavioural Advertising Task Force to develop industry best practices. Another thought is that the fight over Phorm is far from over since the European Commission is continuing its own investigation under EU laws. I hope readers here, who are more informed than I am about the the regulatory landscape in the EU and UK, can provide additional hypotheses about why Phorm has, thus far, not suffered the same fate as NebuAd.

Phorm's Harms Extend Beyond Privacy

Last week, I wrote about the privacy concerns surrounding Phorm, an online advertising company who has teamed up with British ISPs to track user Web behavior from within their networks. New technical details about its Webwise system have since emerged, and it’s not just privacy that now seems to be at risk. The report exposes a system that actively degrades user experience and alters the interaction with content providers. Even more importantly, the Webwise system is a clear violation of the sacred end-to-end principle that guides the core architectural design of the Internet.

Phorm’s system does more than just passively gain “access to customers’ browsing records” as previously suggested. Instead, they plan on installing a network switch at each participating ISP that actively interferes with the user’s browsing session by injecting multiple URL redirections before the user can retrieve the requested content. Sparing you most of the nitty-gritty technical details, the switch intercepts the initial HTTP request to the content server to check whether a Webwise cookie–containing the user’s randomly-assigned identifier (UID)– exists in the browser. It then impersonates the requested server to trick the browser into accepting a spoofed cookie (which I will explain later) that contains the same UID. Only then will the switch forward the request and return the actual content to the user. Basically, this amounts to a big technical hack by Phorm to set the cookies that track users as they browse the Web.

In all, a user’s initial request is redirected three times for each domain that is contacted. Though this may not seem like much, this extra layer of indirection harms the user by degrading the overall browsing experience. It imposes an unnecessary delay that will likely be noticeable by users.

The spoofed cookie that Phorm stores on the user’s browser during this process is also a highly questionable practice. Generally speaking, a cookie is specific to a particular domain and the browser typically ensures that a cookie can only be read and written by the domain it belongs to. For example, data in a yahoo.com cookie is only sent when you contact a yahoo.com server, and only a yahoo.com server can put data into that cookie.

But since Phorm controls the switch at the ISP, it can bypass this usual guarantee by impersonating the server to add cookies for other domains. To continue the example, the switch (1) intercepts the user’s request, (2) pretends to be a yahoo.com server, and (3) injects a new yahoo.com cookie that contains the Phorm UID. The browser, believing the cookie to actually be from yahoo.com, happily accepts and stores it. This cookie is used later by Phorm to identify the user whenever the user visits any page on yahoo.com.

Cookie spoofing is problematic because it can change the interaction between the user and the content-providing site. Suppose a site’s privacy policy promises the user that it does not use tracking cookies. But because of Phorm’s spoofing, the browser will store a cookie that (to the user) looks exactly like a tracking cookie from the site. Now, the switch typically strips out this tracking cookie before it reaches the site, but if the user moves to a non-Phorm ISP (say at work), the cookie will actually reach the site in violation of its stated privacy policy. The cookie can also cause other problems, such as a cookie collision if the site cookie inadvertently has the same name as the Phorm cookie.

Disruptive activities inside the network often create these sort of unexpected problems for both users and websites, which is why computer scientists are skeptical of ideas that violate the end-to-end principle. For the uninitiated, the principle, in short, states that system functionality should almost always be implemented at the end hosts of the network, with a few justifiable exceptions. For instance, almost all security functionality (such as data encryption and decryption) is done by end users and only rarely by machines inside the network.

The Webwise system has no business being inside the network and has no role in transporting packets from one end of the network to the other. The technical Internet community has been worried for years about the slow erosion of the end-to-end principle, particularly by ISPs who are looking to further monetize their networks. This principle is the one upon which the Internet is built and one which the ISPs must uphold. Phorm’s system, nearly in production, is a cogent realization of this erosion, and ISPs should keep Phorm outside the gate.