November 21, 2024

Identifying John Doe: It might be easier than you think

Imagine that you want to sue someone for what they wrote, anonymously, in a web-based online forum. To succeed, you’ll first have to figure out who they really are. How hard is that task? It’s a question that Harlan Yu, Ed Felten, and I have been kicking around for several months. We’ve come to some tentative answers that surprised us, and that may surprise you.

Until recently, I thought the picture was very grim for would-be plaintiffs, writing that it should be simple for “even a non-technical Internet user to engage in effectively untraceable speech online.” I still think it’s feasible for most users, if they make enough effort, to remain anonymous despite any level of scrutiny they are practically likely to face. But in recent months, as Harlan, Ed, and I have discussed this issue, we’ve started to see a flip side to the coin: In many situations, it may be far easier to unmask apparently anonymous online speakers than they, I, or many others in the policy community have appreciated. Today, I’ll tell a story that helps explain what I mean.

Anonymous online speech is a mixed bag: it includes some high value speech such as political dissent in repressive regimes, some dreck we happily tolerate on First Amendment grounds, and some material that violates the laws of many jurisdictions, including child pornography and defamatory speech. For purposes of this discussion, let’s focus on cases like the recent AutoAdmit controversy, in which a plaintiff wishes to bring a defamation suit against an anonymous or pseudonymous poster to a web based discussion forum. I’ll assume, as in the AutoAdmit suit, that the plaintiff has at least a facially plausible legal claim, so that if everyone’s identity were clear, it would also be clear that the plaintiff would have the legal option to bring a defamation suit. In the online context, these are usually what’s called “John Doe” suits, because the plaintiff’s lawyer does not know the name of the defendant in the suit, and must use “John Doe” as a stand in name for the defendant. After filing a John Doe suit, the plaintiff’s lawyer can use subpoenas to force third parties to reveal information that might help identify the John Doe defendant.

In situations like these, if a plaintiff’s lawyer cannot otherwise determine who the poster is, the lawyer will typically subpoena the forum web site, seeking the IP address of the anonymous poster. Many widely used web based discussion systems, including for example the popular Wordpress blogging platform, routinely log the IP addresses of commenters. If the web site is able to provide an IP address for the source of the allegedly defamatory comment, the lawyer will do a reverse lookup, a WHOIS search, or both, on that IP address, hoping to discover that the IP address belongs to a residential ISP or another organization that maintains detailed information about its individual users. If the IP address does turn out to correspond to a residential ISP — rather than, say, to an open wifi hub at a coffee shop or library — then the lawyer will issue a second subpoena, asking the ISP to reveal the account details of the user who was using that IP address at the time it was used to transmit the potentially defamatory comment. This is known as a “subpoena chain” because it involves two subpoenas (one to the web site, and a second one, based on the results of the first, to the ISP).

Of course, in many cases, this method won’t work. The forum web site may not have logged the commenter’s IP address. Or, even if an address is available, it might not be readily traceable back to an ISP account: the anonymous commenter may been using an anonymization tool like Tor to hide his address. Or he may have been coming online from a coffee shop or similarly public place (which typically will not have logged information about its transient users). Or, even if he reached the web forum directly from his own ISP, that ISP might be located in a foreign jurisdiction, beyond the reach of an American lawyer’s usual legal tools.

Is this a dead end for the plaintiff’s lawyer, who wants to identify John Doe? Probably not. There are a range of other parties, not yet part of our story, who might have information that could help identify John Doe. When it comes to the AutoAdmit site, one of these parties is StatCounter.com, a web traffic measurement service that AutoAdmit uses to keep track of trends in its traffic over time.

At the moment I am writing this post, anyone can verify that AutoAdmit uses StatCounter by visiting AutoAdmit.com and choosing “View Source” from the web browser menu. The first screenfull of web page code that comes up includes a block of text helpfully labeled “StatCounter Code,” which in turn runs a small piece of javascript that places a personalized StatCounter cookie on the machine of every user who visits AutoAdmit, or else (if one is already present) detects and records exactly which cookie it is. That’s how StatCounter can tell which visitors to AutoAdmit.com are new, which ones are returning, and which pages on the site are of greatest interest to new and returning users. StatCounter is in a position to track not only each user, but also each page, and each visit by a user to a certain page, over time. This includes not only the home page, but also the particular web page for each discussion “thread” on the site. Moreover, each post (even if anonymous) is marked with the time it was posted, down to the minute. So the plaintiff’s lawyer in our story could go to StatCounter, and ask only about visits to the particular thread where the relevant message was posted. If the post went up at 6:03 p.m. on a certain date, the lawyer could ask StatCounter, “What if anything do you know about the person who visited this web page at 6:03 p.m. on this date?” Of course, if John Doe’s browser is configured to refuse cookies, he wouldn’t be trackable. But most web based discussion sites, including AutoAdmit, rely on cookies to let people log in to their pseudonymous accounts in order to post comments in the first place. In any case, the web is much less convenient place without cookies, and as a practical matter most users do allow them.

In fact, the lawyer may be able to do better still: The anonymous commenter will have accessed the page at least twice — once to view the discussion as it stood before he took part, and again after clicking the button to add his own post to the mix. If StatCounter recorded both visits, as it very likely would have, then it becomes even easier to tie the anonymous commenter to his StatCounter cookie (and to whatever browsing history StatCounter has associated with that cookie).

There are a huge number of things to discuss here, and we’ll tackle several in the coming days. What would a web analytics provider like StatCounter know? Likely answers include IP addresses, times, and durations for the anonymous commenter’s previous visits to AutoAdmit. What about other, similar services, used by other sites? What about “beacons” that simply and silently collect data about users, and pay webmasters for the privilege? What about behavioral advertisers, whose business model involves tracking users across multiple sites and developing knowledge of their browsing habits and interests? What about content distribution networks? How would this picture change if John Doe were taking affirmative steps, such as using Tor, to obfuscate his identity?

These are some of the questions that we’ll try to address in future posts.

The Markey Net Neutrality Bill: Least Restrictive Network Management?

It’s an exciting time in the net neutrality debate. FCC Chairman Jules Genachowski’s speech on Monday promised a new FCC proceeding that will aim to create a formal rule to replace the Commission’s existing policy statement.

Meanwhile, net neutrality advocates in Congress are pondering new legislation for two reasons: First, there is a debate about whether the FCC currently has enough authority to enforce a net neutrality rule. Second, regardless of whether the Commission has such authority today or doesn’t, some would rather see net neutrality rules etched into statute than leave them to the uncertainties of the rulemaking process under this and future Commissions.

One legislative proposal comes from Rep. Ed Markey and colleagues. Called the Internet Freedom Preservation Act of 2009, its current draft is available on the Free Press web site.

I favor the broad goals that motivate this bill — an Internet that remains friendly to innovation and broadly available. But I personally believe the current draft of this bill would be a mistake, because it embodies a very optimistic view of the FCC’s ability to wield regulatory authority and avoid regulatory capture, not only under the current administration but also over the long-run future. It puts a huge amount of statutory weight behind the vague-till-now idea of “reasonable network management” — something that the FCC’s policy statement (and many participants in the debate) have said ISPs should be permitted to do, but whose meaning remains unsettled. Indeed, Ed raised questions back in 2006 about just how hard it might be to decide what this phrase should mean.

The section of the Markey bill that would be labeled as section 12 (d) in statute says that a network management practice

. . . is a reasonable practice only if it furthers a critically important interest, is narrowly tailored to further that interest, and is the means of furthering that interest that is the least restrictive, least discriminatory, and least constricting of consumer choice available.

This language — particularly the trio of “leasts” — puts the FCC in a position to intervene if, in the Commission’s judgment, any alternative course of action would have been better for consumers than the one an ISP actually took. Normally, to call something “reasonable” means that it is within the broad range of possibilities that might make sense to an imagined “reasonable person.” This bill’s definition of “reasonable” is very different, since on its terms there is no scope for discretion within reasonableness — the single best option is the only one deemed reasonable by the statute.

The bill’s language may sound familiar — it is a modified form of the judicial “strict scrutiny” standard the courts use to review government action when the state uses a suspect classification (such as race) or burdens a fundamental right (such as free speech in certain contexts). In those cases, the question is whether or not a “compelling governmental interest” justifies the policy under review. Here, however, it’s not totally clear whose interest, in what, must be compelling in order for a given network management practice to count as reasonable. We are discussing the actions of ISPs, who are generally public companies– do their interests in profit maximization count as compelling? Shareholders certainly think so. What about their interests in R&D? Or, does the statute mean to single out the public’s interest in the general goods outlined in section 12 (a), such as “protect[ing] the open and interconnected nature of broadband networks” ?

I fear the bill would spur a food fight among ISPs, each of whom could complain about what the others were doing. Such a battle would raise the probability that those ISPs with the most effective lobbying shops will prevail over those with the most attractive offerings for consumers, if and when the two diverge.

Why use the phrase “reasonable network management” to describe this exacting standard? I think the most likely answer is simply that many participants in the net neutrality debate use the phrase as a shorthand term for whatever should be allowed — so that “reasonable” turns out to mean “permitted.”

There is also an interesting secondary conversation to be had here about whether it’s smart to bar in statue, as the Markey bill would, “. . .any offering that. . . prioritizes traffic over that of other such providers,” which could be read to bar evenhanded offers of prioritized packet routing to any customer who wants to pay a premium, something many net neutrality advocates (including, e.g. Prof. Lessig) have said they think is fine.

My bottom line is that we ought to speak clearly. It might or might not make sense to let the FCC intervene whenever it finds ISPs’ network management to be less than perfect (I think it would not, but recognize the question is debatable). But whatever its merits, a standard like that — removing ISP discretion — deserves a name of its own. Perhaps “least restrictive network management” ?

Cross-posted at the Yale ISP Blog.

Open Government Data: Starting to Judge the Results

Like many others who read this blog, I’ve spent some time over the last year trying to get more civic data online. I’ve argued that government’s failure to put machine-readable data online is the key roadblock that separates us from a world in which exciting, Web 2.0 style technologies enrich nearly every aspect of civic life. This is an empirical claim, and as more government data comes online, it is being tested.

Jay Nath is the “manager of innovation” for the City and County of San Francisco, working to put municipal data online and build a community of developers who can make the most of it. In a couple of recent blog posts, he has considered the empirical state of government data publishing efforts. Drawing on data from Washington DC, where officials led by then-city CTO Vivek Kundra have put a huge catalog of government data online, he analyzed usage statistics and found an 80/20 pattern of public use of online government data — enormous interest in crime statistics and 311-style service requests, but relatively little about housing code enforcement and almost none about city workers’ use of purchasing credit cards. Here’s the chart: he made (larger version)

Note that this chart measures downloads, not traffic to downstream sites that may be reusing the data.

This analysis was part of a broader effort in San Francisco to begin measuring the return on investments in open government data. One simple measure, as many have remarked before, is foregone IT expenditures that are avoided when third party innovators make it unnecessary for government to provide certain services or make certain investments. But this misses what seems, intuitively, to be the lion’s share of the benefit: New value that didn’t exist before and is created by the extra functionality that third party innovators deliver, but government would not. Another approach is to measure government responsiveness before and after effectiveness data begin to be published. Unfortunately, such measures are unlikely to be controlled — if services get worse, for example, it may have more to do with budget cuts than with any victory, or failure, of citizen monitoring.

Open government data advocates and activists have allies on the inside in a growing number of governmental contexts, from city hall to the White House. But for these allies to be successful, they will need to be able to point to concrete results — sooner and more urgently in the current economic climate than they might have had to do otherwise. This holds a clear lesson for the activists: Small, tangible, steps that turn published government data into cost savings, measurable service improvements, or other concrete goods will “punch above their weight” : not only are they valuable in their own right, but they help favorably disposed civic servants make the case internally for more transparency and disclosure. Beyond aiming for perfection and thinking about the long run, the volunteer community would benefit from seeking low hanging fruit that will prove the concept of open government data and justify further investment.

The rise of the "nanostory"

In today’s Wall Street Journal, I offer a review of Bill Wasik’s excellent new book, And Then There’s This: How Stories Live and Die in Viral Culture. Cliff’s notes version: This is a great new take on the little cultural boomlets and cryptic fads that seem to swarm all over the Internet. The author draws on his personal experience, including his creation of the still-hilarious Right Wing New York Times. Here’s a taste from the book itself—Wasik describing his decision to create the first flash mob:

It was out of the question to create a project that might last, some new institution or some great work of art, for these would take time, exact cost, require risk, even as their odds of success hovered at nearly zero. Meanwhile, the odds of creating a short-lived sensation, of attracting incredible attention for a very brief period of time, were far more promising indeed… I wanted my new project to be what someone would call “The X of the Summer” before I even contemplated exactly what X might be.

Recovery Act Spending: Getting to the Bottom Line

Under most circumstances, government spending is slow and deliberate—a key fact that helps reduce the chances of waste and fraud. But the recently passed Recovery Act is a special case: spending the money quickly is understood to be essential to the success of the Act. We all know that shoppers in a hurry tend to get less value for their money. But, ironically, the overall macroeconomic impact of the stimulus (and hence the average stimulative effect per dollar spent) may be maximized by quick spending, even if the speed premium does increase the total amount of waste and abuse.

This situation creates a paradox for transparency and oversight efforts. On the one hand, the quicker pace of spending makes it all the more important to provide for public scrutiny, and to provide information in ways that will rapidly enable as many people as possible to take advantage of the stimulus opportunities available to them. On the other, the same rush that makes transparency important also reduces the time available for those within government to design and build an infrastructure for stimulus transparency.

One of the troubling tradeoffs that has been made thus far involves information about stimulus funds that flow from the federal government to states and then from states to localities. This pattern is rarer than you might think, since much of the Recovery Act spending flows more directly from federal agencies to its end recipients. But for funds that do follow a path from federal to state to local officials, recent guidance issued April 3 by the Office of Management and Budget (OMB) makes clear that the federal reporting infrastructure being created for Recovery.gov will not collect information about what the localities ultimately do with the funds.

OMB says that it does have the legal authority to require detailed reporting on “all levels of subawards,” reaching end recipients (Acme Concrete or whomever gets a contract or grant from the municipality at the end of the governmental chain). But in the context of its sprint to get at least some system into place as soon as possible (with the debut date for the Recovery.gov system already pushed back to October), OMB has left this deep-level reporting out of its immediate plans. The office says that it “plans to expand the reporting model in the future to also obtain this information, once the system capabilities and processes have been established.”

On Monday, ten congressmen sent a letter to OMB urging it to collect this detailed information “as early as possible.” One reason for OMB to formulate detailed operational plans in this area, as I argued in recent testimony before the House Committee on Oversight and Government Reform, is that clarity from the top will help states make competent choices about what if anything they should do to support or supplement the federal reporting. As the members of Congress write:

While it is positive that OMB goes on to reserve the right in the guidance to expand this reporting model in the future, it would seem exercising this right and requiring this level of reporting as early as possible would help entities prepare for the disclosures before projects begin and provide clarification for states as they begin investing in new infrastructure to track ARRA funds.

In the end, everyone agrees that this detailed information about subawards is important to have—OMB “plans to collect” it and the signatories to yesterday’s letter want collection to start “as soon as possible.” But how soon is that? We don’t really know. The details of hard choices facing OMB as it races to implement the Recovery.gov reporting system are themselves not public, and making them public might (or might not) itself slow down the development of the site. If no system were permitted to launch without fully detailed reporting of subawards, we might wait longer for the web site’s launch. How much longer? OMB might not itself be sure, since software development times are notoriously difficult to forecast, and OMB has never before been asked to build a system of this kind. OMB asserts that it’s moving as fast as it can to collect as much information as possible, and without slowing it down to ask for explanations, we can’t really check that assertion.

Transparency often reduces the degree to which citizens must trust public officials. But in this case, ironically, it seems most reasonable to operate on the optimistic but realistic assumption that the people working on Recovery Act transparency are doing their jobs well, and to hope for good results.