April 24, 2014


Web Certification Fail: Bad Assumptions Lead to Bad Technology

It should be abundantly clear, from two recent posts here, that the current model for certifying the identity of web sites is deeply flawed. When you connect to a web site, and your browser displays an https URL and a happy lock or key icon indicating a secure connection, the odds that you’re connecting to an impostor site, despite your browser’s best efforts, are uncomfortably high.

How did this happen? The last two posts unpacked some of the detailed problems with the current system. Today I want to explore the root cause: today’s system is based on wildly unrealistic assumptions about organizations and trust.

The theory behind the system is simple. Browser vendors will identify a set of Certificate Authorities (CAs) who are trusted to certify identities. Browsers will automatically accept any identity certificate issued by any of the trusted CAs.

The first step in making this system work is identifying some CA who is trusted by everybody in the world.

If that last sentence didn’t strike you as odd, go back and read it again. That’s right, the system assumes that there is some party who is trusted by everyone in the world — a spectacularly naive assumption.

Network engineers like to joke about the “evil bit”, a hypothetical label put on each network packet, indicating whether the packet is evil. (See RFC 3514, Steve Bellovin’s classic parody standards document codifying the evil bit. I’ve always loved that the official Internet standards series accepts parody standards.) Well, the “trusted bit” for certificate authorities is pretty much as the same as the evil bit, only applied to organizations rather than network packets. Yet somehow we ended up with a design that relies on this “trusted bit”.

The reason, in part, is unclear thinking about institutional trust, abetted by the unclear language we often use in discussing trust online. For example, we tend to conflate two meanings of the word “trusted”. The first meaning of “trusted”, which is the everyday meaning, implies a judgment that a party is unlikely to misbehave. The second meaning of “trusted”, more common in military security settings, is a factual statement that someone is vulnerable to misbehavior by another. In an ideal world, we would make sure that someone was trusted in the first sense before they became trusted in the second sense, that is, we would make sure that someone was unlikely to misbehave before we we made ourselves vulnerable to their misbehavior. This isn’t easy to do — and we will forget entirely to do it if we confuse the two meanings of trusted.

The second linguistic problem is to use the passive-voice construction “A is trusted to do X” rather than the active form “B trusts A to do X.” The first form is problematic because it doesn’t say who is doing the trusting. Consider these two statements: (A) “CNNIC is a trusted certificate authority.” (B) “Everyone trusts CNNIC to be a certificate authority.” The first statement might sound plausible, but the second is obviously false.

If you try to explain to yourself why the existing web certification system is sound, while avoiding the two errors above (confusing two senses of “trusted”, and failing to say who is doing the trusting), you’ll see pretty quickly that the argument for the current system is tenuous at best. You’ll see, too, that we can’t fix the system by using different cryptography — what we need are new institutional arrangements.


Web Security Trust Models

[This is part of a series of posts on this topic: 1, 2, 3, 4, 5, 6, 7, 8.]

Last week, Ed described the current debate over whether Mozilla should allow an organization that is allegedly controlled by the Chinese government to be a default trusted certificate authority. The post prompted some very insightful feedback, including questions about alternative trust models. I will try to lay out the different types of models on a high level, and I encourage corrections or clarifications. It’s worth re-stating that what we’re talking about is how you as a web user know that who you are talking to is who they claim to be (if they are, then you can be confident that your other security measures like end-to-end encryption are working).

Flat and Inflexible
This is the model we use now. Your browser comes pre-loaded with a list of Certificate Authorities that it will trust to guarantee the authenticity of web sites you visit. For instance, Mozilla (represented by the little red dragon in the diagram) ships Firefox with a list of pre-approved CAs. Each browser vendor makes its own list (here is Mozilla’s policy for how to get added). The other major browsers use the same model and have themselves already allowed CNNIC to become trusted for their users. This is a flat model because each CA has just as much authority as the others, thus each effectively sits at the “root” of authority. Indeed any of the CAs can sign certificates for any entity in the world (hence the asterisk in each). They do not coordinate with each other, and can sign a certificate for an entity even if another CA has already done so. Furthermore, they can confer this god-like power on other entities without oversight or the prior knowledge of the end users or the entities being signed for.

This is also an inflexible model because there is no reasonable way to impose finer-grained control on the authority of the CAs. The standard used is called X.509. It doesn’t allow you to trust Verisign to a greater or lesser extent than the Chinese government — it is essentially all or nothing for each. You also can’t tell your browser to trust CNNIC only for sites in China (although domain name constraints do exist in the standard, they are not widely implemented). It is also inflexible because most browsers intentionally make it difficult for a user to change the certificate list. It might be possible to partially mitigate some of the CA/X.509 shortcomings by implementing more constraints, improving the user interface, adding “out of band” certificate checks (like Perspectives), or generating more paranoid certificate warnings (like Certificate Patrol).

Decentralized and Dependent
In the early days of the web, an alternative approach already existed. This model did away entirely with a default set of external trusted entities and gave complete control to the individual. The idea was that you would start by trusting only people you “knew” (smiley faces in the diagram) to begin to build a “web of trust.” You then extend this web by trusting those people to vouch for others that you haven’t met (kind of like a a secure virtual version of Goodfellas). This makes it a fundamentally decentralized model. There is nothing limiting certain entities from gaining the trust of many people and therefore becoming de facto Certificate Authorities. This has only happened within technically proficient communities, and in the case of USENIX they eventually discontinued the service.

So, this is a system that is highly dependent on having some connection with whoever you want to communicate with. It has enjoyed some limited success via the PGP family of standards, but mostly for applications such as email or in more constrained situations like inter/intra-enterprise security. It is possible that with the boon in online social networks there is a new opportunity to renew interest in a web-of-trust style security architecture. The approach seems less practical for general web security because it requires the user to have some existing trust relationship with a site before using it securely. It is not necessarily an impossible approach — and the mod_openpgp and mod_gnutls projects show some technical promise — but as a practical matter wide-scale adoption of a “web of trust” style security model for the web seems unlikely.

Hierarchical and Delegated
A third approach starts with a single highly trusted root and delegates authority recursively. Any authority can only issue certificates for itself or the entities that fall “underneath” it, thus limiting the god-like power of the flat model. This also pushes signing power closer to the authenticated sites themselves. It is possible that this authority could be placed directly in their hands, rather than requiring an external authority to approve of each new certificate or domain. Note that I am describing this in a very domain-centric way. If we are willing to fully buy into the domain hierarchy way of thinking about web security, there may be a viable implementation path for this model.

Perhaps the greatest example of this delegation approach to web governance is the existing Domain Name System. Decisions at the root of DNS are governed by the international non-profit ICANN, which assigns authority to Top Level Domains (eg: .com, .net, .cn) who then further delegate through a system of registrars. The biggest problem with tying site authentication to DNS is that DNS is deeply insecure. However, within the next year a more secure version of DNS, DNSSEC, is scheduled to be deployed at the DNS root. Any DNSSEC query can be verified by following the chain of authority back to the root, and any contents of the response can be guaranteed to be unaltered through that chain of trust. The question is whether this infrastructure can be the basis for distributing site certificates as well, which could form the basis for hierarchical site authenticity (which would also permit encryption of traffic). CNNIC happens to also be the registry for the .cn TLD, so in this case it would be restricted to creating certificates for .cn domains. This approach is advocated by Dan Kaminsky (interview, presentation) and Paul Vixie (here, here). I’ve also found posts by Eric Rescorla and Jason Roysdon informative.

If implemented via DNSSEC, this approach would thoroughly bind web site authentication to the DNS hierarchy, and the only assurance it would provide is that you are communicating with the person who registered the domain you are visiting. It would not provide any additional verification about who that person is, as Certificate Authorities theoretically could do (but practically don’t). Certificates were originally envisioned as a way to guarantee that a particular real-world entity was behind the site in question, but market pressures caused CAs cut corners on the verification process. Most CAs now offer “Domain Validation” (DV) certificates that are issued without any human intervention and simply verify that the person requesting the certificate has control of the domain in question. These certificates are treated no differently than more rigorously verified certificates, so for all intents and purposes the DNSSEC certificate delegation model would provide at least the services of the current CA model. One exception is Extended Validation certificates, which require the CA to perform more rigorous checks and cause the browser URL bar to take on a “green glow”. It should hover be noted that there are some security flaws with the current implementation.

[Update: I discuss the DNSSEC approach in more detail here]

Open Questions
Are there appropriate stopgap measures on the existing CA model that can limit authority of certain political entities? Are there viable user interface improvements? Are users aware enough of these issues to do anything meaningful with more information about certificates? Does the hierarchical model force us to trust ICANN, and do we? Does the DNS hierarchy appropriately allocate authority? Is domain name enough of a proxy for identity that a DNS-based system makes sense? Do we need better ways of independently validating a person’s identity and binding that to their public key? Even if an alternative model is better, how do we motivate adoption?


Google Buzzkill

The launch of Google Buzz, the new social networking service tied to GMail, was a fiasco to say the least. Its default settings exposed people’s e-mail contacts in frightening ways with serious privacy and human rights implications. Evgeny Morozov, who specializes in analyzing how authoritarian regimes use the Internet, put it bluntly last Friday in a blog post: “If I were working for the Iranian or the Chinese government, I would immediately dispatch my Internet geek squads to check on Google Buzz accounts for political activists and see if they have any connections that were previously unknown to the government.”

According to the BBC, the Buzz development team bypassed Google’s standard trial and testing procedures in order to launch the product quickly. Apparently, the company only tested it internally with Google employees and failed to test the product with a more diverse range of users who are more likely to have brought up the issues which were so glaringly obvious after launch. Google has apologized and moved to correct the most eggregious privacy flaws, though problems – including security issues – continue to be raised. PC World has a good overview of Buzz’s evolution since launch.

Meanwhile, damage has been done not only to Google’s reputation but also to an unknown number of users who found themselves and their contacts exposed in ways they did not choose or want. Exposing vulnerable users without their knowledge or choice even for a few hours can potentially have irreversible consequences. While Google is scoring some points around the tech policy world for reacting quickly and earnestly to the strident user outcry, the Electronic Information Privacy Center (EPIC) has filed an official complaint with the FTC, and that Canada’s Privacy Commissioner has expressed disappointment and asked Google to explain itself. (UPDATE: A class complaint has been filed in San Jose, claiming that Google broke the law by sharing personal data without users’ consent.)

Earlier this week I asked people in my Twitter network how they’re feeling about Buzz after the fixes they’ve made. Some are now reassured but others aren’t. Joe Hall wrote:

@rmack totally lost me for good.. I just can’t believe that they won’t do it again. It will have to be very useful/different to get me back

Some are leaving GMail altogether. Judson Dunn reported:

@rmack my boyfriend deleted his long time gmail account for good :(

I was so concerned about exposing people in my GMail network during the first week after launch that I stayed off Buzz entirely until Monday afternoon. By then I felt that the worst privacy problems had been fixed, and I understood well enough how to tweak the settings that I could at least go in without doing harm to others. After playing with it a bit and poking around I posted some initial observations and invited the people in my network to respond. There are still plenty of issues – some people who claimed in Twitter that they had turned off Buzz are still there, and I think Buzz should make it easier for people to use pseudonyms or nicknames not tied to their email address if they prefer.  From Beijing, Jeremy Goldkorn of the influential media blog Danwei responded: “I like the way Buzz works now, and it seems to me that the privacy concerns have been addressed.”

I’ve noticed that some Chinese Buzz users have been using it to post and discuss material that has been censored by Chinese blog-hosting platforms and social networking sites. If Buzz becomes useful as a way to preserve and spread censored information around quickly, it seems to me that’s a plus as long as people aren’t being exposed in ways they don’t want. My friend Isaac Mao wrote:

It’s more important to Chinese to make information flowing rather than privacy concern this moment. With more hibernating animals in cave, we can’t tell too much on the risks about identity, but more on how to wake up them.

Buzz has unleashed some potentials on sharing which just follows my Sharism theory, people actually have much more stuff to share before they realize them.

But I agree with any conerns on privacy, including the risks that authority may trace publishers in China. It’s very much possible to be targeted once they were notified how profound the new tool is.

The “Great Firewall” is already at work on Buzz, at least in Beijing. While most people seem to be able to access Buzz through GMail on Chinese Internet connections, numerous people report from Beijing that at least some Google profiles – including mine and Isaac’s – are blocked, though people in Shanghai and Guangzhou say they’re not blocked. Others in China report having trouble posting comments to Buzz, though it’s unclear whether this is a technical issue with Buzz or a Chinese network blocking issue, or some strange combination of the two.

It will be interesting to see how things evolve, and whether activists in various countries find Buzz to be a useful alternative to Facebook and other platforms – or not. Whatever happens, I do think that Google fully deserves the negative press it has gotten and continues to get for the thoughtless way in which Buzz was rolled out. There are  senior people at Google whose job it is to focus on free expression issues, and others who work full time on privacy issues. Either the Buzz development team completely failed to consult with these people or were allowed to ignore them. I am inclined to believe the former instead of the latter, based on my interactions with the company through the Global Network Initiative and Google’s support for Global Voices. Call me biased or sympathetic if you want, but I don’t think that the company made a conscious decision to ignore the risks it was creatin
g for human rights activists or people with abusive spouses – or anybody else with privacy concerns. However, if we do give Google the benefit of the doubt, then the only logical conclusion is that in this case, something about the company’s management and internal communications was so broken that the company was unable to prevent a new product from unintentionally doing evil. Nick Summers at Newsweek thinks the problem is broader:

Google is so convinced of the righteousness of its mission statement that it launches products heedlessly. Take Google Books—the company was so in thrall with its plan to make all hardbound knowledge searchable that it did not anticipate a $125 million legal challenge from publishers. With Google Wave, engineers got high on their own talk that they had invented a means of communication superior to e-mail—until Wave launched and users laughed at its baffling un-usability. Last week, with Buzz, Google seemed so bewitched by the possibilities of a Google-y take on social networking that it went live without thinking through the privacy implications.

Whatever the case may be in terms of Google’s internal thinking or intentions, we have a right to be concerned. Too many of us depend on Google for too many things. As I’ve written before, I believe Google has a responsibility to netizens around the world to develop more effective mechanisms to ensure that the concerns, interests, and rights of the world’s netizens are adequately incorporated into the development process.

I’d very much like to hear your ideas for how netizens’ concerns around the world – particularly from at-risk and marginalized communities who have the most to lose when Google gets things wrong – might be channeled to Google’s development teams and product managers. Rather than wait for Google to figure this out, are there mechanisms that we as netizens might be able to build?  Are there things we can proactively do to help companies like Google avoid doing evil? Can we help them to avoid hurting us – and also help them to maximize the amount of good they can do?

(Cross-posted from RConversation)


Mozilla Debates Whether to Trust Chinese CA

[Note our follow-up posts on this topic: Web Security Trust Models, and Web Certification Fail: Bad Assumptions Lead to Bad Technology]

Sometimes geeky technical details matter only to engineers. But sometimes a seemingly arcane technical decision exposes deep social or political divisions. A classic example is being debated within the Mozilla project now, as designers decide whether the Mozilla Firefox browser should trust a Chinese certification authority by default.

Here’s the technical background: When you browse to a secure website (typically at a URL starting with “https:”), your browser takes two special security precautions: it sets up a private, encrypted “channel” to the server, and it authenticates the server’s identity. The second step, authentication, is necessary because a secure channel is useless if you don’t know who is on the other end. Without authentication, you might be talking to an impostor.

Suppose you’re connecting to https://mail.google.com, to pick up your Gmail. To authenticate itself to you, the server will (1) do some fancy math to prove to you that it knows a certain encryption key, and (2) present you with a digital certificate (or “cert”) attesting that only Google knows that encryption key. The cert is created by a Certification Authority (“CA”), which asserts that it has done the necessary due diligence to establish that the designated encryption key is known only to Google Inc.

If the CA is competent and honest, then you can rely on the cert, and your connection will be secure. But a dishonest CA can trick you into talking to an impostor site, so you need to be cautious about which CAs you trust. Your browser comes preinstalled with a list of CAs whom it will trust. In principle you can change this list, but almost nobody does. So browser vendors effectively decide which CAs their users will trust.

With this background in mind, let’s unpack the Mozilla debate. What set off the debate was the addition of the China Internet Network Information Center (CNNIC) as a trusted CA in Firefox. CNNIC is not part of the Chinese government but many people assert that it would be willing to act in concert with the Chinese government.

To see why this is worrisome, let’s suppose, just for the sake of argument, that CNNIC were a puppet of the Chinese government. Then CNNIC’s status as a trusted CA would give it the technical power to let the Chinese government spy on its citizens’ “secure” web connections. If a Chinese citizen tried to make a secure connection to Gmail, their connection could be directed to an impostor Gmail site run by the Chinese government, and CNNIC could give the impostor a cert saying that the government impostor was the real Gmail site. The Chinese citizen would be fooled by the fake Gmail site (having no reason to suspect anything was wrong) and would happily enter his Gmail password into the impostor site, giving the Chinese government free run of the citizen’s email archive.

CNNIC’s defenders respond that any CA could do such a thing. If the problem is that CNNIC is too close to a government, what about the CAs already on the Firefox CA list that are governments? Isn’t CNNIC being singled out because it is Chinese? Doesn’t the country with the largest Internet population deserve at least one slot among the dozens of already trusted CAs? These are all good questions, even if they’re not the whole story.

Mozilla’s decision touches deep questions of fairness, trust, and institutional integrity that I won’t even pretend to address in this post. No single answer will be right for all users.

Part of the problem is that the underlying technical design is fragile. Any CA can certify to any user that any server owns any name, so the consequences of a misplaced trust decision are about as bad as they can be. It’s tempting to write this off as bonehead design, but in truth the available design options are all unattractive.


The Engine of Job Growth? Tracking SBA-backed Loans Through Recovery.gov

Last week at a Town Hall Meeting in New Hampshire, President Obama stated that “we’re going to start where most new jobs start—with small businesses,” and he encouraged Congress to transfer $30 billion from the Troubled Asset Relief Program to a new program called the Small Business Lending Fund. As this proposal was unveiled, the Administrator of the U.S. Small Business Administration (SBA) Karen Mills sat directly behind the President, reflecting the fact that the Administration’s proposal is a vote of confidence in the SBA and its existing loan programs.

The central role proposed for the SBA invites questions about existing SBA loans made with Recovery Act funds. These loans can be tracked through Recovery.gov, the official “user-friendly, public-facing website” that has evolved under the direction of the Recovery Accountability and Transparency Board, an agency created when the President signed into law the American Recovery and Reinvestment Act of 2009 (ARRA) on February 17, 2009.

Curious about how well Recovery.gov works, I analyzed a stimulus loan to a business in Red Lodge, Montana, where I live. First I accessed “Agency Reported” data through Recovery.gov, and then compared that information with what I could learn from field visits with the loan recipient and the community banker who made the loan.

What the drill-down map at Recovery.gov tells you: According to the map available at the official website, a local business called “Sheep Mountain Feed” received an $81,000 loan through the Small Business Administration’s (SBA) “Rural Lender Advantage.”

What the drill-down map at Recovery.gov doesn’t tell you: The official website does not specify how the loan proceeds were spent. Nor does the website explain if the $81,000 is the face value of the loan or the amount guaranteed by the SBA. For that matter, SBA’s role in making the loan is not clarified.

To learn more about these things, I called Sheep Mountain Feed and arranged a visit with the owner, a woman named Deb Padget who, before opening the store, had ranched 2,000 head of bison. I also met with the local banker who arranged the loan (the SBA relies on lenders to make the loans it guarantees), and an SBA employee based in Helena Montana. And for background I reviewed the June 8, 2009 Federal Register Notice relating to SBA’s temporary 90% guarantee (thanks to Princeton’s Fed Thread project).

Sheep Mountain Feed is a retail store catering to animal farmers and pet owners that sells animal feed, electric fencing, baby chicks, and other odds and ends such as buckets and horseshoes sold at any rural animal store. When Deb decided to buy the business in April of 2009, she had managed the retail store for three years, and she wanted to make some changes. Without abandoning the “large-animal” owners who had built the feed business, she saw an opportunity to focus more on pet owners. “Everybody in Red Lodge has a dog,” she told me. “Not everybody has a horse.”

She would need to buy pet supplies to take things in this new direction, and she would also need money to buy the business and remodel the interior of the store. This is how she spent the loan proceeds that she eventually received—buying and remodeling Sheep Mountain Feed, and purchasing inventory. However, the first bank she visited rejected her within ten minutes. At the second bank she tried out, she met with local loan officer and learned quickly that he was also from a North Dakota farming family. Here she got a warmer welcome, and was told that her timing was good: In March 2009, about one month before Deb’s visit, the SBA received $730 million in funding from the ARRA to offer increased loan guarantees and the temporary elimination of loan fees.

To get this “stimulus loan” Deb would need to submit a business plan with her loan application, but she’d never before needed a business plan and didn’t even have an executive summary. She was sent to an SBA employee in Billings for free counseling, and this employee helped Deb to prepare a business plan from scratch. (At one point, in order to develop Deb’s financial projections, the SBA contact called her own dog-groomer to find out about the going-rate for grooming sessions in Billings).

The U.S. Small Business Administration (SBA) was created in 1953 as an independent agency of the federal government to help people start and grow businesses. Even without the stimulus money, SBA’s so-called 7(a) loan program guarantees up to 85% of a qualifying loan made to a local business through a local bank. The guarantee is designed to induce local banks to lend more into the community by removing most of the risk of default. And as previously mentioned, in early 2009 the SBA received Recovery money to guarantee up to 90% of 7(a) loans. This is the kind of loan that Deb received.

In addition to subsidizing SBA’s temporary 90 percent guarantee, the Recovery Act also allowed SBA to temporarily waive certain fees that it charges. Usually the agency collects fees equal to three percent of the loan’s face value to cover delinquencies. Lenders and borrowers pay these fees. In this case, the community bank that made the loan and Deb would have had to pay $2,790 just to close the deal. We know this because the breakdown of the loan to Sheep Mountain Feed at USASpending.gov shows an “original subsidy cost” of $2,790. By studying the data at USASpending, and interviewing offline sources, it also emerged that $81,000 is the amount guaranteed by the SBA (Sheep Mountain Feed got $90,000).

The takeaway from this study is that Recovery.gov provides good data, but not always enough context (e.g. an explanation of SBA’s role) to understand the data. Yet in the absence of Recovery.gov, even learning that Sheep Mountain Feed received a government-backed loan would have been difficult, so the official website is a helpful starting point for people motivated to track stimulus money.

By disseminating information about a Montana-based loan to citizens in every state, including citizens not predisposed to support any specific local project, Recovery.gov provides the public with information about what the government is doing and invites feedback. How the government processes this feedback—and in general takes advantage of the insight of people inside and outside the Federal government—is an open question, but at least the Recovery Board is on it, and now it’s also the focus of a working group (pursuant to OMB’s December 8, 2009 Open Government Directive).

In that spirit, here are a few suggestions for making Recovery.gov more useful to people trying to track SBA-backed stimulus loans.

(1) Create web links to the SBA website where the agency explains how the standard and stimulus-enriched 7(a) loan program works (SBA itself does not make loans, but instead guarantees a portion of loans made and administered by banks);

(2) Create links to the Small Business Act (15 U.S.C. § 636, as amended), the relevant provisions of the American Recovery and Reinvestment Act of 2009 affecting the SBA, (ARRA, P. L. 111-5, §§501-502), and the provisions of the Department of Defense Appropriations Act, 2010 that extend the stimulus-enriched SBA program through the end of February 2010;

(3) Establish links from Recovery.gov to USASpending.gov, particularly targeted links showing the source of the stimulus loan information. Recovery.gov does explain that “Agency Reported Data” comes from three sources, including USAspending.gov, but there are no links from stimulus projects to USASpending.

This project was more about Recovery.gov than the SBA, but listening to President Obama urge the creation of a Small Business Lending Fund because it “will help small banks do even more of what our economy needs – and that’s ensure that small businesses are once again the engine of job growth in America,” there was the obvious question about the $90,000 loan to Sheep Mountain Feed: Would it create or retain any jobs? I put this question to Deb. She said that the loan “created” one full-time job, her job running the business. She’s also employing a dog-groomer part-time, and another part-time employee (a student) who works on weekends. Getting these facts is easier than knowing if the full $90,000 loan to Sheep Mountain Feed should be credited to the Recovery Act. Would the business have received the loan anyway, even without SBA’s extra 5% guarantee and the temporary elimination of $2,790.00 in fees? The only sure thing is that estimating the employment impact of the Recovery Act is complicated (it was the subject of a recent OMB Guidance Memorandum). That’s something everybody can agree on.


The Traceability of an Anonymous Online Comment

Yesterday, I described a simple scenario where a plaintiff, who is having difficulty identifying an alleged online defamer, could benefit from subpoenaing data held by a third party web service provider. Some third parties—like Facebook in yesterday’s example—know exactly who I am and know whenever I visit or post on other sites. But even when no third party has the whole picture, it may still be possible to identify me indirectly, by combining data from different third parties. This is possible because loading one webpage can potentially trigger dozens of nearly simultaneous web connections to various third party service providers, whose records can then be subpoenaed and correlated.

Suppose that I post an anonymous and potentially defamatory comment on a Boing Boing article, but Boing Boing for some reason is unable to supply the plaintiff with any hints about who I am—not even my IP address. The plaintiff will only know that my comment was posted publicly at “9:42am on Fri. Feb 5.” But as I mentioned yesterday, Boing Boing—like almost every other site on the web—takes advantage of a handful of useful third party web services.

For example, one of these services—for an article that happens to feature video—is an embedded streaming media service that hosts the video that the article refers to. The plaintiff could issue a subpoena to the video service and ask for information about any user that loaded that particular embedded video via Boing Boing around “9:42am on Fri. Feb 5.” There might be one user match or a few user matches, depending on the site’s traffic at the time, but for simplicity, say there is only one match—me. Because the video service tracks each user with a unique persistent cookie, the service can and probably does keep a log of all videos that I have ever loaded from their service, whether or not I actually watched them. The subpoena could give the plaintiff a copy of this log.

In perusing my video logs, the plaintiff may see that I loaded a different video, earlier that week, embedded into an article on TechCrunch. He may notice further that TechCrunch uses Google Analytics. With two more subpoenas—one to TechCrunch and one to Google—and some simple matching up of dates and times from the different logs, the plaintiff can likely rebuild a list of all the other Analytics-enabled websites that I’ve visited, since these will likely be noted in the records tied to my Analytics cookie.

The bottom line: From the moment I first load that video on Boing Boing, the plaintiff gains the power to traverse multiple silos of data, held by independent third party entities, to trace my activities and link my anonymous comment to my web browsing history. Given how heavily I use the web, my browsing history will tell the plaintiff a lot about me, and it will probably be enough to uniquely identify who I am.

But this is just one example of many potential paths that a plaintiff could take to identify me. Recall from yesterday that when I visit Boing Boing, the site quietly forwards my information to the servers of at least 17 other parties. Each one of these 17 is a potential subpoena target in the first round of discovery. The information culled from this first round—most importantly, what other websites I’ve visited and at what times—could inform a second round of subpoenas, targeted to these other now-relevant websites and third parties. From there, as you might already be able to tell, the plaintiff can repeat this data linking process and expand the circle of potentially identifying information.

A recent privacy study from Berkeley shows how far such a strategy might reach. The Berkeley researchers found that nearly all of the top 100 sites on the web contain some sort of “web bug,” another term for the hidden web connection that allows a third party to automatically track a user on the site. Some of these sites will load dozens of web bugs on each page visit, which will litter user data far and wide on third party servers. Moreover, the study found that Google Analytics—by far the most popular website statistics service—was used by more than 70% of all sites they surveyed in March 2009. Once they add other Google-run services like Doubleclick and Adsense into the calculation, this figure rises to 88% of all sites that use some Google service—an astonishingly broad and dominant ability to follow users as they browse the web. But even other smaller, but still popular, third party entities have significant reach across thousands of sites across the web.

The traceability of any given site visitor will still depend on context: the number of third party services used by the site, the popularity of each third party service across the web, the types of identifying data that these parties collect and store, whether the speaker used any online anonymity tools, and many other site-specific factors.

Despite the variability in third party tracing capabilities, the nearly simultaneous connections to a few third party services means that the results of tracing can be combined. By sleuthing through information held in third party dossiers, logs and databases, plaintiffs in John Doe lawsuits will have many more discovery options than they had ever previously imagined.


What Third Parties Know About John Doe

As David mentioned in his previous post, plaintiffs’ lawyers in online defamation suits will typically issue a sequence of two “John Doe” subpoenas to try to unmask the identity of anonymous online speakers. The first subpoena goes to the website or content provider where the allegedly defamatory remarks were posted, and the second subpoena is sent to the speaker’s ISP. Both entities—the content provider and the ISP—are natural targets for civil discovery. Their logs together will often contain enough information to trace the remarks back to the speaker’s real identity. But when this isn’t enough to identify the speaker, the discovery process traditionally fails.

Are plaintiffs in these cases out of luck? Not if their lawyers know where else to look.

There are numerous third party web services that may hold just enough clues to reidentify the speaker, even without the help of the content provider or the ISP. The vast majority of websites today depend on third parties to deliver valuable services that would otherwise be too expensive or time-consuming to develop in-house. Services such as online advertising, content distribution and web analytics are almost always handled by specialized servers from third party businesses. As such, a third party can embed its service into a wide variety of sites across the web, allowing it to track users across all the sites where it maintains a presence.

Take for example the popular online blog Boing Boing. Upon loading its main page while recording the HTTP session, I noticed that my browser is automatically redirected to domains owned by no fewer than 17 distinct third party entities: 10 services that engage in advertising or marketing, five that embed media or integrate social networking functionality, and two that provide web analytics. By visiting this single webpage, my digital footprints have been scattered to and collected by at least 17 other online entities that I made no deliberate attempt to contact. And each of these entities will likely have stored a cookie on my web browser, allowing it to identify me uniquely later when I browse to one of its other partner sites. I don’t mean to pick on Boing Boing specifically—taking advantage of third party services is a nearly universal practice on the web today, but it’s exactly this pervasiveness that makes it so likely, if not probable, that all of my digital footprints together could link much of my online activities back to my actual identity.

To make this point concrete, let’s say I post a potentially defamatory remark about someone using a pseudonym in the comments section of a Boing Boing article. It happens that for each article, Boing Boing displays the number of times that the article has been shared on Facebook. In order to fetch the current number, Boing Boing redirects my browser to api.facebook.com to make a real-time query to the Facebook API. Since I happen to be logged in to Facebook at the time of the request, my browser forwards with the query my unique Facebook cookie, which includes information that explicitly identifies me—namely, my e-mail address that doubles as my Facebook username.

In order to integrate a bit of useful social networking functionality, Boing Boing enables Facebook, a third party in this situation, to learn which articles I visit on Boing Boing and the dates and times of my visits. The same is true for Tweetmeme, which can now positively link my Twitter account—which I’m also logged in to—with my Boing Boing visits. Even without an authenticated login, the 15 other third parties present on Boing Boing could track me using any number of different methods, including browser fingerprinting, to build detailed dossiers that slowly begin to piece together who I am.

From the perspective of a plaintiff’s lawyer, even if Boing Boing is unwilling or unable to produce any useful information, these third parties might be able to uniquely identify me as the likely defamer, or at least narrow the list of possible speakers down to a handful of users. But tracing speech is not always this easy. Tomorrow, I’ll discuss more complicated discovery strategies and the extent to which they are technically feasible.


Identifying John Doe: It might be easier than you think

Imagine that you want to sue someone for what they wrote, anonymously, in a web-based online forum. To succeed, you’ll first have to figure out who they really are. How hard is that task? It’s a question that Harlan Yu, Ed Felten, and I have been kicking around for several months. We’ve come to some tentative answers that surprised us, and that may surprise you.

Until recently, I thought the picture was very grim for would-be plaintiffs, writing that it should be simple for “even a non-technical Internet user to engage in effectively untraceable speech online.” I still think it’s feasible for most users, if they make enough effort, to remain anonymous despite any level of scrutiny they are practically likely to face. But in recent months, as Harlan, Ed, and I have discussed this issue, we’ve started to see a flip side to the coin: In many situations, it may be far easier to unmask apparently anonymous online speakers than they, I, or many others in the policy community have appreciated. Today, I’ll tell a story that helps explain what I mean.

Anonymous online speech is a mixed bag: it includes some high value speech such as political dissent in repressive regimes, some dreck we happily tolerate on First Amendment grounds, and some material that violates the laws of many jurisdictions, including child pornography and defamatory speech. For purposes of this discussion, let’s focus on cases like the recent AutoAdmit controversy, in which a plaintiff wishes to bring a defamation suit against an anonymous or pseudonymous poster to a web based discussion forum. I’ll assume, as in the AutoAdmit suit, that the plaintiff has at least a facially plausible legal claim, so that if everyone’s identity were clear, it would also be clear that the plaintiff would have the legal option to bring a defamation suit. In the online context, these are usually what’s called “John Doe” suits, because the plaintiff’s lawyer does not know the name of the defendant in the suit, and must use “John Doe” as a stand in name for the defendant. After filing a John Doe suit, the plaintiff’s lawyer can use subpoenas to force third parties to reveal information that might help identify the John Doe defendant.

In situations like these, if a plaintiff’s lawyer cannot otherwise determine who the poster is, the lawyer will typically subpoena the forum web site, seeking the IP address of the anonymous poster. Many widely used web based discussion systems, including for example the popular WordPress blogging platform, routinely log the IP addresses of commenters. If the web site is able to provide an IP address for the source of the allegedly defamatory comment, the lawyer will do a reverse lookup, a WHOIS search, or both, on that IP address, hoping to discover that the IP address belongs to a residential ISP or another organization that maintains detailed information about its individual users. If the IP address does turn out to correspond to a residential ISP — rather than, say, to an open wifi hub at a coffee shop or library — then the lawyer will issue a second subpoena, asking the ISP to reveal the account details of the user who was using that IP address at the time it was used to transmit the potentially defamatory comment. This is known as a “subpoena chain” because it involves two subpoenas (one to the web site, and a second one, based on the results of the first, to the ISP).

Of course, in many cases, this method won’t work. The forum web site may not have logged the commenter’s IP address. Or, even if an address is available, it might not be readily traceable back to an ISP account: the anonymous commenter may been using an anonymization tool like Tor to hide his address. Or he may have been coming online from a coffee shop or similarly public place (which typically will not have logged information about its transient users). Or, even if he reached the web forum directly from his own ISP, that ISP might be located in a foreign jurisdiction, beyond the reach of an American lawyer’s usual legal tools.

Is this a dead end for the plaintiff’s lawyer, who wants to identify John Doe? Probably not. There are a range of other parties, not yet part of our story, who might have information that could help identify John Doe. When it comes to the AutoAdmit site, one of these parties is StatCounter.com, a web traffic measurement service that AutoAdmit uses to keep track of trends in its traffic over time.

At the moment I am writing this post, anyone can verify that AutoAdmit uses StatCounter by visiting AutoAdmit.com and choosing “View Source” from the web browser menu. The first screenfull of web page code that comes up includes a block of text helpfully labeled “StatCounter Code,” which in turn runs a small piece of javascript that places a personalized StatCounter cookie on the machine of every user who visits AutoAdmit, or else (if one is already present) detects and records exactly which cookie it is. That’s how StatCounter can tell which visitors to AutoAdmit.com are new, which ones are returning, and which pages on the site are of greatest interest to new and returning users. StatCounter is in a position to track not only each user, but also each page, and each visit by a user to a certain page, over time. This includes not only the home page, but also the particular web page for each discussion “thread” on the site. Moreover, each post (even if anonymous) is marked with the time it was posted, down to the minute. So the plaintiff’s lawyer in our story could go to StatCounter, and ask only about visits to the particular thread where the relevant message was posted. If the post went up at 6:03 p.m. on a certain date, the lawyer could ask StatCounter, “What if anything do you know about the person who visited this web page at 6:03 p.m. on this date?” Of course, if John Doe’s browser is configured to refuse cookies, he wouldn’t be trackable. But most web based discussion sites, including AutoAdmit, rely on cookies to let people log in to their pseudonymous accounts in order to post comments in the first place. In any case, the web is much less convenient place without cookies, and as a practical matter most users do allow them.

In fact, the lawyer may be able to do better still: The anonymous commenter will have accessed the page at least twice — once to view the discussion as it stood before he took part, and again after clicking the button to add his own post to the mix. If StatCounter recorded both visits, as it very likely would have, then it becomes even easier to tie the anonymous commenter to his StatCounter cookie (and to whatever browsing history StatCounter has associated with that cookie).

There are a huge number of things to discuss here, and we’ll tackle several in the coming days. What would a web analytics provider like StatCounter know? Likely answers include IP addresses, times, and durations for the anonymous commenter’s previous visits to AutoAdmit. What about other, similar services, used by other sites? What about “beacons” that simply and silently collect data about users, and pay webmasters for the privilege? What about behavioral advertisers, whose business model involves tracking users across multiple sites and developing knowledge of their browsing habits and interests? What about content distribution networks? How would this picture change if John Doe were taking affirmative steps, such as using Tor, to obfuscate his identity?

These are some of the questions that we’ll try to address in future posts.


CITP Seeks Visiting Faculty, Scholars or Policy Experts for 2010-2011

The Center for Information Technology Policy (CITP) at Princeton University seeks candidates for positions as visiting faculty members or researchers, or postdoctoral research associates for the 2010-2011 academic year.

About CITP

Digital technologies and public life are constantly reshaping each other—from net neutrality and broadband adoption, to copyright and file sharing, to electronic voting and beyond.

Realizing digital technology’s promise requires a constant sharing of ideas, competencies and norms among the technical, social, economic and political domains.

The Center for Information Technology Policy is Princeton University’s effort to meet this challenge. Its new home, which opened in September 2008, is a state of the art facility designed from the ground up for openness and collaboration. Located at the intellectual and physical crossroads of Princeton’s engineering and social science communities, the Center’s research, teaching and public programs are building the intellectual and human capital that our technological future demands.

To see what this mission can mean in practice, take a look at our website, at http://citp.princeton.edu.

About the Search

The Center has secured limited resources from a range of sources to support visiting faculty, scholars or policy experts for up to one-year appointments during the 2010-2011 academic year. We are interested in applications from academic faculty and researchers as well as from individuals who have practical experience in the policy arena. The rank and status of the successful applicant(s) will be determined on a case-by-case basis. We are particularly interested in hearing from faculty members at other universities and from individuals who have first-hand experience in public service in the technology policy area.

The successful applicant(s) will conduct research, engage in public programs, and may teach a seminar during their appointment subject to review and approval by the Dean of the Faculty. They’ll play an important role at a pivotal time in the development of this new center. They may be appointed to a visiting faculty or visiting fellow position, a term-limited research position, or a postdoctoral appointment, depending on qualifications.

We are happy to hear from anyone who works at the intersection of digital technology and public life. In addition to our existing strengths in computer science and sociology, we are particularly interested in identifying engineers, economists, lawyers, civil servants and policy analysts whose research interests are complementary to our existing activities.

If you are interested, please submit a CV and cover letter, stating background, intended research, and salary requirements, to https://jobs.princeton.edu.

Princeton University is an equal opportunity employer and complies with applicable EEO and affirmative action regulations. For information about applying to Princeton and voluntarily self-identifying, please see http://www.princeton.edu/dof/about_us/dof_job_openings/

Deadline: March 1, 2010.


iPad to Test Zittrain's "Future of the Internet" Thesis

Jonathan Zittrain famously argued in his book “The Future of the Internet, and How to Stop It” that we were headed for a future in which general purpose computers would be replaced by locked-down computing appliances.

Apple’s new iPad will put Zittrain’s thesis to the test. The iPad, as announced, has aspects of both an appliance and a general purpose computer. (Zittrain would say “generative”, but I’ll stick with the standard computer science term “general purpose”.) Will the appliance side kill the general-purpose side?

The iPad is an appliance in the sense that it runs applications from Apple’s App Store. The App Store is a “walled garden” containing only apps that have been approved by Apple. Apple has systematically refused to approve certain types of apps, and it has subjected apps to a vetting process that can be slow and mystifying. To the extent that Apple refuses broad categories of apps, this is an appliance approach to computing.

On the other hand, the iPad has a web browser. Modern browsers have become general-purpose platforms for delivering a broad class of applications. Pair a Bluetooth keyboard to your iPad, fire up the browser, and you have a fancy netbook — a general-purpose device that can run applications of any type.

For the iPad to become a Zittrain-type appliance, two things must happen. First, Apple must remain picky about which apps are available in the App Store. Second, Apple must limit the device’s browser so that it lacks the features that make today’s browsers viable application platforms. Will Apple be able to limit their product in this way, despite competition from other, more general-purpose tablets? I doubt it.

But even this — even an appliance-style iPad — would not be enough to prove Zittrain’s thesis. Zittrain argued not just that appliances would exist, but that they would replace general purpose computers. Amazon’s kindle is an appliance, but it doesn’t prove Zittrain’s thesis because nobody is ditching their laptop in favor of a Kindle. Instead, the Kindle is an extra device which is used for its purpose, while the general-purpose device is used for everything else. If the iPad ends up like the Kindle — a complement to the laptop or netbook, rather than a replacement for it — this will not prove Zittrain’s thesis.

It seems unlikely, then, that the iPad, even if it succeeds, will provide strong support for Zittrain’s thesis. General-purpose computers are so useful that we’re not likely to abandon them.

UPDATE: A few minutes after posting this, I saw that Zittrain had published his own take on this question.