October 12, 2024

Identifying John Doe: It might be easier than you think

Imagine that you want to sue someone for what they wrote, anonymously, in a web-based online forum. To succeed, you’ll first have to figure out who they really are. How hard is that task? It’s a question that Harlan Yu, Ed Felten, and I have been kicking around for several months. We’ve come to some tentative answers that surprised us, and that may surprise you.

Until recently, I thought the picture was very grim for would-be plaintiffs, writing that it should be simple for “even a non-technical Internet user to engage in effectively untraceable speech online.” I still think it’s feasible for most users, if they make enough effort, to remain anonymous despite any level of scrutiny they are practically likely to face. But in recent months, as Harlan, Ed, and I have discussed this issue, we’ve started to see a flip side to the coin: In many situations, it may be far easier to unmask apparently anonymous online speakers than they, I, or many others in the policy community have appreciated. Today, I’ll tell a story that helps explain what I mean.

Anonymous online speech is a mixed bag: it includes some high value speech such as political dissent in repressive regimes, some dreck we happily tolerate on First Amendment grounds, and some material that violates the laws of many jurisdictions, including child pornography and defamatory speech. For purposes of this discussion, let’s focus on cases like the recent AutoAdmit controversy, in which a plaintiff wishes to bring a defamation suit against an anonymous or pseudonymous poster to a web based discussion forum. I’ll assume, as in the AutoAdmit suit, that the plaintiff has at least a facially plausible legal claim, so that if everyone’s identity were clear, it would also be clear that the plaintiff would have the legal option to bring a defamation suit. In the online context, these are usually what’s called “John Doe” suits, because the plaintiff’s lawyer does not know the name of the defendant in the suit, and must use “John Doe” as a stand in name for the defendant. After filing a John Doe suit, the plaintiff’s lawyer can use subpoenas to force third parties to reveal information that might help identify the John Doe defendant.

In situations like these, if a plaintiff’s lawyer cannot otherwise determine who the poster is, the lawyer will typically subpoena the forum web site, seeking the IP address of the anonymous poster. Many widely used web based discussion systems, including for example the popular Wordpress blogging platform, routinely log the IP addresses of commenters. If the web site is able to provide an IP address for the source of the allegedly defamatory comment, the lawyer will do a reverse lookup, a WHOIS search, or both, on that IP address, hoping to discover that the IP address belongs to a residential ISP or another organization that maintains detailed information about its individual users. If the IP address does turn out to correspond to a residential ISP — rather than, say, to an open wifi hub at a coffee shop or library — then the lawyer will issue a second subpoena, asking the ISP to reveal the account details of the user who was using that IP address at the time it was used to transmit the potentially defamatory comment. This is known as a “subpoena chain” because it involves two subpoenas (one to the web site, and a second one, based on the results of the first, to the ISP).

Of course, in many cases, this method won’t work. The forum web site may not have logged the commenter’s IP address. Or, even if an address is available, it might not be readily traceable back to an ISP account: the anonymous commenter may been using an anonymization tool like Tor to hide his address. Or he may have been coming online from a coffee shop or similarly public place (which typically will not have logged information about its transient users). Or, even if he reached the web forum directly from his own ISP, that ISP might be located in a foreign jurisdiction, beyond the reach of an American lawyer’s usual legal tools.

Is this a dead end for the plaintiff’s lawyer, who wants to identify John Doe? Probably not. There are a range of other parties, not yet part of our story, who might have information that could help identify John Doe. When it comes to the AutoAdmit site, one of these parties is StatCounter.com, a web traffic measurement service that AutoAdmit uses to keep track of trends in its traffic over time.

At the moment I am writing this post, anyone can verify that AutoAdmit uses StatCounter by visiting AutoAdmit.com and choosing “View Source” from the web browser menu. The first screenfull of web page code that comes up includes a block of text helpfully labeled “StatCounter Code,” which in turn runs a small piece of javascript that places a personalized StatCounter cookie on the machine of every user who visits AutoAdmit, or else (if one is already present) detects and records exactly which cookie it is. That’s how StatCounter can tell which visitors to AutoAdmit.com are new, which ones are returning, and which pages on the site are of greatest interest to new and returning users. StatCounter is in a position to track not only each user, but also each page, and each visit by a user to a certain page, over time. This includes not only the home page, but also the particular web page for each discussion “thread” on the site. Moreover, each post (even if anonymous) is marked with the time it was posted, down to the minute. So the plaintiff’s lawyer in our story could go to StatCounter, and ask only about visits to the particular thread where the relevant message was posted. If the post went up at 6:03 p.m. on a certain date, the lawyer could ask StatCounter, “What if anything do you know about the person who visited this web page at 6:03 p.m. on this date?” Of course, if John Doe’s browser is configured to refuse cookies, he wouldn’t be trackable. But most web based discussion sites, including AutoAdmit, rely on cookies to let people log in to their pseudonymous accounts in order to post comments in the first place. In any case, the web is much less convenient place without cookies, and as a practical matter most users do allow them.

In fact, the lawyer may be able to do better still: The anonymous commenter will have accessed the page at least twice — once to view the discussion as it stood before he took part, and again after clicking the button to add his own post to the mix. If StatCounter recorded both visits, as it very likely would have, then it becomes even easier to tie the anonymous commenter to his StatCounter cookie (and to whatever browsing history StatCounter has associated with that cookie).

There are a huge number of things to discuss here, and we’ll tackle several in the coming days. What would a web analytics provider like StatCounter know? Likely answers include IP addresses, times, and durations for the anonymous commenter’s previous visits to AutoAdmit. What about other, similar services, used by other sites? What about “beacons” that simply and silently collect data about users, and pay webmasters for the privilege? What about behavioral advertisers, whose business model involves tracking users across multiple sites and developing knowledge of their browsing habits and interests? What about content distribution networks? How would this picture change if John Doe were taking affirmative steps, such as using Tor, to obfuscate his identity?

These are some of the questions that we’ll try to address in future posts.

CITP Seeks Visiting Faculty, Scholars or Policy Experts for 2010-2011

The Center for Information Technology Policy (CITP) at Princeton University seeks candidates for positions as visiting faculty members or researchers, or postdoctoral research associates for the 2010-2011 academic year.

About CITP

Digital technologies and public life are constantly reshaping each other—from net neutrality and broadband adoption, to copyright and file sharing, to electronic voting and beyond.

Realizing digital technology’s promise requires a constant sharing of ideas, competencies and norms among the technical, social, economic and political domains.

The Center for Information Technology Policy is Princeton University’s effort to meet this challenge. Its new home, which opened in September 2008, is a state of the art facility designed from the ground up for openness and collaboration. Located at the intellectual and physical crossroads of Princeton’s engineering and social science communities, the Center’s research, teaching and public programs are building the intellectual and human capital that our technological future demands.

To see what this mission can mean in practice, take a look at our website, at http://citp.princeton.edu.

About the Search

The Center has secured limited resources from a range of sources to support visiting faculty, scholars or policy experts for up to one-year appointments during the 2010-2011 academic year. We are interested in applications from academic faculty and researchers as well as from individuals who have practical experience in the policy arena. The rank and status of the successful applicant(s) will be determined on a case-by-case basis. We are particularly interested in hearing from faculty members at other universities and from individuals who have first-hand experience in public service in the technology policy area.

The successful applicant(s) will conduct research, engage in public programs, and may teach a seminar during their appointment subject to review and approval by the Dean of the Faculty. They’ll play an important role at a pivotal time in the development of this new center. They may be appointed to a visiting faculty or visiting fellow position, a term-limited research position, or a postdoctoral appointment, depending on qualifications.

We are happy to hear from anyone who works at the intersection of digital technology and public life. In addition to our existing strengths in computer science and sociology, we are particularly interested in identifying engineers, economists, lawyers, civil servants and policy analysts whose research interests are complementary to our existing activities.

If you are interested, please submit a CV and cover letter, stating background, intended research, and salary requirements, to https://jobs.princeton.edu.

Princeton University is an equal opportunity employer and complies with applicable EEO and affirmative action regulations. For information about applying to Princeton and voluntarily self-identifying, please see http://www.princeton.edu/dof/about_us/dof_job_openings/

Deadline: March 1, 2010.

iPad to Test Zittrain's "Future of the Internet" Thesis

Jonathan Zittrain famously argued in his book “The Future of the Internet, and How to Stop It” that we were headed for a future in which general purpose computers would be replaced by locked-down computing appliances.

Apple’s new iPad will put Zittrain’s thesis to the test. The iPad, as announced, has aspects of both an appliance and a general purpose computer. (Zittrain would say “generative”, but I’ll stick with the standard computer science term “general purpose”.) Will the appliance side kill the general-purpose side?

The iPad is an appliance in the sense that it runs applications from Apple’s App Store. The App Store is a “walled garden” containing only apps that have been approved by Apple. Apple has systematically refused to approve certain types of apps, and it has subjected apps to a vetting process that can be slow and mystifying. To the extent that Apple refuses broad categories of apps, this is an appliance approach to computing.

On the other hand, the iPad has a web browser. Modern browsers have become general-purpose platforms for delivering a broad class of applications. Pair a Bluetooth keyboard to your iPad, fire up the browser, and you have a fancy netbook — a general-purpose device that can run applications of any type.

For the iPad to become a Zittrain-type appliance, two things must happen. First, Apple must remain picky about which apps are available in the App Store. Second, Apple must limit the device’s browser so that it lacks the features that make today’s browsers viable application platforms. Will Apple be able to limit their product in this way, despite competition from other, more general-purpose tablets? I doubt it.

But even this — even an appliance-style iPad — would not be enough to prove Zittrain’s thesis. Zittrain argued not just that appliances would exist, but that they would replace general purpose computers. Amazon’s kindle is an appliance, but it doesn’t prove Zittrain’s thesis because nobody is ditching their laptop in favor of a Kindle. Instead, the Kindle is an extra device which is used for its purpose, while the general-purpose device is used for everything else. If the iPad ends up like the Kindle — a complement to the laptop or netbook, rather than a replacement for it — this will not prove Zittrain’s thesis.

It seems unlikely, then, that the iPad, even if it succeeds, will provide strong support for Zittrain’s thesis. General-purpose computers are so useful that we’re not likely to abandon them.

UPDATE: A few minutes after posting this, I saw that Zittrain had published his own take on this question.

Census of Files Available via BitTorrent

BitTorrent is popular because it lets anyone distribute large files at low cost. Which kinds of files are available on BitTorrent? Sauhard Sahi, a Princeton senior, decided to find out. Sauhard’s independent work last semester, under my supervision, set out to measure what was available on BitTorrent. This post, summarizing his results, was co-written by Sauhard and me.

Sauhard chose a (uniform) random sample of files available via the trackerless variant of BitTorrent, using the Mainline DHT. The sample comprised 1021 files. He classified the files in the sample by file type, language, and apparent copyright status.

Before describing the results, we need to offer two caveats. First, the results apply only to the Mainline trackerless BitTorrent system that we surveyed. Other parts of the BitTorrent ecosystem might be different. Second, all files that were available were equally likely to appear in the sample — the sample was not weighted by number of downloads, and it probably contains files that were never downloaded at all. So we can’t say anything about the characteristics of BitTorrent downloads, or even of files that are downloaded via BitTorrent, only about files that are available on BitTorrent.

With that out of the way, here’s what Sauhard found.

File types

46% movies and shows (non-pornographic)
14% games and software
14% pornography
10% music
1% books and guides
1% images
14% could not classify

Movies/Shows

For the movies and shows category, the predominant file format was AVI, and other formats included RMVB (a proprietary format for RealPlayer), MPEG, raw DVD, and some multi-part RAR archives. Interestingly, this section was heavily biased towards recent movies, instead of being spread out evenly over a number of years. In descending order of frequency, we found that 60% of the randomly selected movies and shows were in English, 8% were in Spanish, 7% were in Russian, 5% were in Polish, 5% were in Japanese, 4% were in Chinese, 4% could not be determined, 3% were in French, 1% were in Italian, and other infrequent languages accounted for 2% of the distribution.

Games/Software

For the games and software category, there was no clearly dominant file type, but common file types for software included ISO disc images, multi-part RAR archives, and EXE (Windows executables). The games were targeted for running on different architectures, such as the XBOX 360, Nintendo Wii, and Windows PC’s. In descending order, we found that 74% of games and software in the sample were in English, 12% were in Japanese, 5% were in Spanish, 4% were in Chinese, 2% were in Polish, and 1% were in Russian and French each.

Pornography

For the pornography category, the predominant encoding format was AVI, similar to the movies category. However, there were significantly more MPG and WMV (Windows Media Video) files available. Also, most pornography torrents included the full pornographic video, a sample of the video (a 1-5 minute extract of the video), as well as posters or images of the porn stars in JPEG format. Also, as these videos are not typically dated like movies are, it is difficult to make any remarks regarding the recency bias for pornographic torrents. Our assumption would be that demand for pornography is not as time-sensitive as demand for movies, so it is likely that these pornographic videos constitute a broader spectrum of time than the movies do. In descending order, we found that 53% of pornography in our sample was in English, 16% was in Chinese, 15% was in Japanese, 6% was in Russian, 3% was in German, 2% was in French, 2% was unclassifiable, and Italian, Hindi, and Spanish appeared infrequently (1% each).

Music

For the music category, the predominant encoding format for music was MP3, there were some albums ripped to WMA (Windows Media Audio, a Microsoft codec), and there were also ISO images and multi-part RAR archives. There is still a bias towards recent albums and songs, but it is not as strongly evident as it is for movies—perhaps because people are more willing to continue seeding music even after it is no longer new, so these torrents are able to stay alive longer in the DHT. In descending order, we found that 78% of music torrents in our sample were in English, 6% were in Russian, 4% were in Spanish, 2% were in Japanese and Chinese each, and other infrequent languages appeared 1% each.

Books/Guides

The books/guides and images categories were fairly minor. We classified 15 torrents under books and guides—13 were in English, 1 was in French, and 1 was in Russian. We classified 3 image torrents—one was a set of national park wallpapers, one was a set of pictures of BMW cars (both of these are English), and one was a Japanese comic strip.

Apparent Copyright Infringement

Our final assessment involved determining whether or not each file seemed likely to be copyright-infringing. We classified a file as likely non-infringing if it appeared to be (1) in the public domain, (2) freely available through legitimate channels, or (3) user-generated content. These were judgment calls on our part, based on the contents of the files, together with some external research.

By this definition, all of the 476 movies or TV shows in the sample were found to be likely infringing. We found seven of the 148 files in the games and software category to be likely non-infringing—including two Linux distributions, free plug-in packs for games, as well as free and beta software. In the pornography category, one of the 145 files claimed to be an amateur video, and we gave it the benefit of the doubt as likely non-infringing. All of the 98 music torrents were likely infringing. Two of the fifteen files in the books/guides category seemed to be likely non-infringing.

Overall, we classified ten of the 1021 files, or approximately 1%, as likely non-infringing, This result should be interpreted with caution, as we may have missed some non-infringing files, and our sample is of files available, not files actually downloaded. Still, the result suggests strongly that copyright infringement is widespread among BitTorrent users.

A Free Internet, If We Can Keep It

“We stand for a single internet where all of humanity has equal access to knowledge and ideas. And we recognize that the world’s information infrastructure will become what we and others make of it. ”

These two sentences, from Secretary of State Clinton’s groundbreaking speech on Internet freedom, sum up beautifully the challenge facing our Internet policy. An open Internet can advance our values and support our interests; but we will only get there if we make some difficult choices now.

One of these choices relates to anonymity. Will it be easy to speak anonymously on the Internet, or not? This was the subject of the first question in the post-speech Q&A:

QUESTION: You talked about anonymity on line and how we have to prevent that. But you also talk about censorship by governments. And I’m struck by – having a veil of anonymity in certain situations is actually quite beneficial. So are you looking to strike a balance between that and this emphasis on censorship?

SECRETARY CLINTON: Absolutely. I mean, this is one of the challenges we face. On the one hand, anonymity protects the exploitation of children. And on the other hand, anonymity protects the free expression of opposition to repressive governments. Anonymity allows the theft of intellectual property, but anonymity also permits people to come together in settings that gives them some basis for free expression without identifying themselves.

None of this will be easy. I think that’s a fair statement. I think, as I said, we all have varying needs and rights and responsibilities. But I think these overriding principles should be our guiding light. We should err on the side of openness and do everything possible to create that, recognizing, as with any rule or any statement of principle, there are going to be exceptions.

So how we go after this, I think, is now what we’re requesting many of you who are experts in this area to lend your help to us in doing. We need the guidance of technology experts. In my experience, most of them are younger than 40, but not all are younger than 40. And we need the companies that do this, and we need the dissident voices who have actually lived on the front lines so that we can try to work through the best way to make that balance you referred to.

Secretary Clinton’s answer is trying to balance competing interests, which is what good politicians do. If we want A, and we want B, and A is in tension with B, can we have some A and some B together? Is there some way to give up a little A in exchange for a lot of B? That’s a useful way to start the discussion.

But sometimes you have to choose — sometimes A and B are profoundly incompatible. That seems to be the case here. Consider the position of a repressive government that wants to spy on a citizen’s political speech, as compared to the position of the U.S. government when it wants to eavesdrop on a suspect’s conversations under a valid search warrant. The two positions are very different morally, but they are pretty much the same technologically. Which means that either both governments can eavesdrop, or neither can. We have to choose.

Secretary Clinton saw this tension, and, being a lawyer, she saw that law could not resolve it. So she expressed the hope that technology, the aspect she understood least, would offer a solution. This is a common pattern: Given a difficult technology policy problem, lawyers will tend to seek technology solutions and technologists will tend to seek legal solutions. (Paul Ohm calls this “Felten’s Third Law”.) It’s easy to reject non-solutions in your own area because you have the knowledge to recognize why they will fail; but there must be a solution lurking somewhere in the unexplored wilderness of the other area.

If we’re forced to choose — and we will be — what kind of Internet will we have? In Secretary Clinton’s words, “the world’s information infrastructure will become what we and others make of it.” We’ll have a free Internet, if we can keep it.