November 30, 2024

Government Datasets That Facilitate Innovation

[This is the first post in a series on best practices for government datasets by Harlan Yu and me.]

There’s a growing consensus that the government can increase its openness and transparency by publishing its raw data in bulk online. As several Freedom to Tinker contributors argued in Government Data and the Invisible Hand, publishing data empowers third party software developers to produce innovative new technologies that engage citizens and illuminate government’s inner workings. With the establishment of Data.gov and the federal Open Government Initiative, federal agencies are quickly embracing a culture of machine-readable data release, and many states and municipalities are now following their lead.

But how usable are these datasets for developers? The answer lies primarily in the structure and contents of the datasets themselves. While all data in digital form is technically machine-readable in some sense, the ease of use for machine-readable datasets can vary widely. In fact, machine-readability is just a baseline requirement: a developer can’t start to work with a dataset until it’s in this form. Once that minimum standard is met, the critical factor is how easy it is for developers to use the dataset in new, innovative ways.

In this series of posts, we’ll draw on our experience building applications that use government data to offer some thoughts about best practices government could follow in releasing data. By taking a few straightforward steps in preparing its datasets, government can make the data much more useful to developers.

One key factor in determining ease of use for developers is the structure of the dataset, and that is the topic of our first post. Let’s start with a trivial example:

<BOOK>A Tale of Two Cities by Charles Dickens. Chapter 1. The Period. It was the best of times, it was the worst of times [...] The end.</BOOK>

This is a “well-formed” XML version of Dicken’s “A Tale of Two Cities” in its entirety. Though more usable than a PDF copy of the book, the XML document lacks basic structure and is not particularly helpful to a developer building tools to display or analyze the book. Compare that to:

<BOOK>
  <HEADER>
    <TITLE>A Tale of Two Cities</TITLE>
    <AUTHOR>Charles Dickens</AUTHOR>
  </HEADER>
  <BODY>
    <CHAPTER NUMBER="1">
      <TITLE>The Period</TITLE>
      <PARAGRAPH NUMBER="1">
        <SENTENCE NUMBER="1">It was the best of times [...]</SENTENCE>
      </PARAGRAPH>
      [...]
    </CHAPTER>
    [...]
  </BODY>
</BOOK>

This data is far more structured, and a developer can take it and immediately do lots of new things. If the developer plans to build an interface for a new e-book reader for instance, it’s easy to extract the component parts of the book for appropriate formatting. With the less-structured version, the developer needs to guess where chapters, titles, and paragraphs begin and end. Because manual analysis is infeasible for large, complex datasets, developers who have only minimally-structured data will need to build automated processing scripts to make these guesses. Developing these scripts can be difficult and time-consuming, and data quality will suffer because the scripts will inevitably make mistakes.

Whether a dataset facilitates innovative uses by developers is not a yes or no question but a matter of degree, and it depends largely on the quality of the data’s structure and the needs of specific developers. In deciding what structure to add, agencies should consider who is in the best position to add various types of structure to the data. Sometimes, the agency is in the best position. Employees of an agency may amass specialized knowledge about the data, or the agency may already internally store the data with structural details like explicit database columns. In these cases, the agency can provide this structure with little effort, relieving developers from the potentially Herculean task of reconstructing these details. In other cases, the agency may have no significant advantage over private parties.

Agencies should get as close to this dividing line as is reasonably possible to broaden the range of creative possibilities for application developers. The goal is to minimize structural obstacles that might prevent developers from tinkering with the data. Better structure leads to more innovative tools, a more transparent government, and a greater appreciation for the work done by federal agencies.

Over our next several posts, we’ll discuss choices that agencies make when releasing datasets and the ways these choices affect developers. Among other things, we’ll explore basic data format lessons, data labeling, and correction/modification of datasets. Our goal is to turn this series into a best practices white paper for government use, and we’d appreciate any comments, suggestions, or insights from readers.

Web Certification Fail: Bad Assumptions Lead to Bad Technology

It should be abundantly clear, from two recent posts here, that the current model for certifying the identity of web sites is deeply flawed. When you connect to a web site, and your browser displays an https URL and a happy lock or key icon indicating a secure connection, the odds that you’re connecting to an impostor site, despite your browser’s best efforts, are uncomfortably high.

How did this happen? The last two posts unpacked some of the detailed problems with the current system. Today I want to explore the root cause: today’s system is based on wildly unrealistic assumptions about organizations and trust.

The theory behind the system is simple. Browser vendors will identify a set of Certificate Authorities (CAs) who are trusted to certify identities. Browsers will automatically accept any identity certificate issued by any of the trusted CAs.

The first step in making this system work is identifying some CA who is trusted by everybody in the world.

If that last sentence didn’t strike you as odd, go back and read it again. That’s right, the system assumes that there is some party who is trusted by everyone in the world — a spectacularly naive assumption.

Network engineers like to joke about the “evil bit”, a hypothetical label put on each network packet, indicating whether the packet is evil. (See RFC 3514, Steve Bellovin’s classic parody standards document codifying the evil bit. I’ve always loved that the official Internet standards series accepts parody standards.) Well, the “trusted bit” for certificate authorities is pretty much as the same as the evil bit, only applied to organizations rather than network packets. Yet somehow we ended up with a design that relies on this “trusted bit”.

The reason, in part, is unclear thinking about institutional trust, abetted by the unclear language we often use in discussing trust online. For example, we tend to conflate two meanings of the word “trusted”. The first meaning of “trusted”, which is the everyday meaning, implies a judgment that a party is unlikely to misbehave. The second meaning of “trusted”, more common in military security settings, is a factual statement that someone is vulnerable to misbehavior by another. In an ideal world, we would make sure that someone was trusted in the first sense before they became trusted in the second sense, that is, we would make sure that someone was unlikely to misbehave before we we made ourselves vulnerable to their misbehavior. This isn’t easy to do — and we will forget entirely to do it if we confuse the two meanings of trusted.

The second linguistic problem is to use the passive-voice construction “A is trusted to do X” rather than the active form “B trusts A to do X.” The first form is problematic because it doesn’t say who is doing the trusting. Consider these two statements: (A) “CNNIC is a trusted certificate authority.” (B) “Everyone trusts CNNIC to be a certificate authority.” The first statement might sound plausible, but the second is obviously false.

If you try to explain to yourself why the existing web certification system is sound, while avoiding the two errors above (confusing two senses of “trusted”, and failing to say who is doing the trusting), you’ll see pretty quickly that the argument for the current system is tenuous at best. You’ll see, too, that we can’t fix the system by using different cryptography — what we need are new institutional arrangements.

Web Security Trust Models

[This is part of a series of posts on this topic: 1, 2, 3, 4, 5, 6, 7, 8.]

Last week, Ed described the current debate over whether Mozilla should allow an organization that is allegedly controlled by the Chinese government to be a default trusted certificate authority. The post prompted some very insightful feedback, including questions about alternative trust models. I will try to lay out the different types of models on a high level, and I encourage corrections or clarifications. It’s worth re-stating that what we’re talking about is how you as a web user know that who you are talking to is who they claim to be (if they are, then you can be confident that your other security measures like end-to-end encryption are working).

Flat and Inflexible
This is the model we use now. Your browser comes pre-loaded with a list of Certificate Authorities that it will trust to guarantee the authenticity of web sites you visit. For instance, Mozilla (represented by the little red dragon in the diagram) ships Firefox with a list of pre-approved CAs. Each browser vendor makes its own list (here is Mozilla’s policy for how to get added). The other major browsers use the same model and have themselves already allowed CNNIC to become trusted for their users. This is a flat model because each CA has just as much authority as the others, thus each effectively sits at the “root” of authority. Indeed any of the CAs can sign certificates for any entity in the world (hence the asterisk in each). They do not coordinate with each other, and can sign a certificate for an entity even if another CA has already done so. Furthermore, they can confer this god-like power on other entities without oversight or the prior knowledge of the end users or the entities being signed for.

This is also an inflexible model because there is no reasonable way to impose finer-grained control on the authority of the CAs. The standard used is called X.509. It doesn’t allow you to trust Verisign to a greater or lesser extent than the Chinese government — it is essentially all or nothing for each. You also can’t tell your browser to trust CNNIC only for sites in China (although domain name constraints do exist in the standard, they are not widely implemented). It is also inflexible because most browsers intentionally make it difficult for a user to change the certificate list. It might be possible to partially mitigate some of the CA/X.509 shortcomings by implementing more constraints, improving the user interface, adding “out of band” certificate checks (like Perspectives), or generating more paranoid certificate warnings (like Certificate Patrol).

Decentralized and Dependent
In the early days of the web, an alternative approach already existed. This model did away entirely with a default set of external trusted entities and gave complete control to the individual. The idea was that you would start by trusting only people you “knew” (smiley faces in the diagram) to begin to build a “web of trust.” You then extend this web by trusting those people to vouch for others that you haven’t met (kind of like a a secure virtual version of Goodfellas). This makes it a fundamentally decentralized model. There is nothing limiting certain entities from gaining the trust of many people and therefore becoming de facto Certificate Authorities. This has only happened within technically proficient communities, and in the case of USENIX they eventually discontinued the service.

So, this is a system that is highly dependent on having some connection with whoever you want to communicate with. It has enjoyed some limited success via the PGP family of standards, but mostly for applications such as email or in more constrained situations like inter/intra-enterprise security. It is possible that with the boon in online social networks there is a new opportunity to renew interest in a web-of-trust style security architecture. The approach seems less practical for general web security because it requires the user to have some existing trust relationship with a site before using it securely. It is not necessarily an impossible approach — and the mod_openpgp and mod_gnutls projects show some technical promise — but as a practical matter wide-scale adoption of a “web of trust” style security model for the web seems unlikely.

Hierarchical and Delegated
A third approach starts with a single highly trusted root and delegates authority recursively. Any authority can only issue certificates for itself or the entities that fall “underneath” it, thus limiting the god-like power of the flat model. This also pushes signing power closer to the authenticated sites themselves. It is possible that this authority could be placed directly in their hands, rather than requiring an external authority to approve of each new certificate or domain. Note that I am describing this in a very domain-centric way. If we are willing to fully buy into the domain hierarchy way of thinking about web security, there may be a viable implementation path for this model.

Perhaps the greatest example of this delegation approach to web governance is the existing Domain Name System. Decisions at the root of DNS are governed by the international non-profit ICANN, which assigns authority to Top Level Domains (eg: .com, .net, .cn) who then further delegate through a system of registrars. The biggest problem with tying site authentication to DNS is that DNS is deeply insecure. However, within the next year a more secure version of DNS, DNSSEC, is scheduled to be deployed at the DNS root. Any DNSSEC query can be verified by following the chain of authority back to the root, and any contents of the response can be guaranteed to be unaltered through that chain of trust. The question is whether this infrastructure can be the basis for distributing site certificates as well, which could form the basis for hierarchical site authenticity (which would also permit encryption of traffic). CNNIC happens to also be the registry for the .cn TLD, so in this case it would be restricted to creating certificates for .cn domains. This approach is advocated by Dan Kaminsky (interview, presentation) and Paul Vixie (here, here). I’ve also found posts by Eric Rescorla and Jason Roysdon informative.

If implemented via DNSSEC, this approach would thoroughly bind web site authentication to the DNS hierarchy, and the only assurance it would provide is that you are communicating with the person who registered the domain you are visiting. It would not provide any additional verification about who that person is, as Certificate Authorities theoretically could do (but practically don’t). Certificates were originally envisioned as a way to guarantee that a particular real-world entity was behind the site in question, but market pressures caused CAs cut corners on the verification process. Most CAs now offer “Domain Validation” (DV) certificates that are issued without any human intervention and simply verify that the person requesting the certificate has control of the domain in question. These certificates are treated no differently than more rigorously verified certificates, so for all intents and purposes the DNSSEC certificate delegation model would provide at least the services of the current CA model. One exception is Extended Validation certificates, which require the CA to perform more rigorous checks and cause the browser URL bar to take on a “green glow”. It should hover be noted that there are some security flaws with the current implementation.

[Update: I discuss the DNSSEC approach in more detail here]

Open Questions
Are there appropriate stopgap measures on the existing CA model that can limit authority of certain political entities? Are there viable user interface improvements? Are users aware enough of these issues to do anything meaningful with more information about certificates? Does the hierarchical model force us to trust ICANN, and do we? Does the DNS hierarchy appropriately allocate authority? Is domain name enough of a proxy for identity that a DNS-based system makes sense? Do we need better ways of independently validating a person’s identity and binding that to their public key? Even if an alternative model is better, how do we motivate adoption?

Google Buzzkill

The launch of Google Buzz, the new social networking service tied to GMail, was a fiasco to say the least. Its default settings exposed people’s e-mail contacts in frightening ways with serious privacy and human rights implications. Evgeny Morozov, who specializes in analyzing how authoritarian regimes use the Internet, put it bluntly last Friday in a blog post: “If I were working for the Iranian or the Chinese government, I would immediately dispatch my Internet geek squads to check on Google Buzz accounts for political activists and see if they have any connections that were previously unknown to the government.”

According to the BBC, the Buzz development team bypassed Google’s standard trial and testing procedures in order to launch the product quickly. Apparently, the company only tested it internally with Google employees and failed to test the product with a more diverse range of users who are more likely to have brought up the issues which were so glaringly obvious after launch. Google has apologized and moved to correct the most eggregious privacy flaws, though problems – including security issues – continue to be raised. PC World has a good overview of Buzz’s evolution since launch.

Meanwhile, damage has been done not only to Google’s reputation but also to an unknown number of users who found themselves and their contacts exposed in ways they did not choose or want. Exposing vulnerable users without their knowledge or choice even for a few hours can potentially have irreversible consequences. While Google is scoring some points around the tech policy world for reacting quickly and earnestly to the strident user outcry, the Electronic Information Privacy Center (EPIC) has filed an official complaint with the FTC, and that Canada’s Privacy Commissioner has expressed disappointment and asked Google to explain itself. (UPDATE: A class complaint has been filed in San Jose, claiming that Google broke the law by sharing personal data without users’ consent.)

Earlier this week I asked people in my Twitter network how they’re feeling about Buzz after the fixes they’ve made. Some are now reassured but others aren’t. Joe Hall wrote:

@rmack totally lost me for good.. I just can’t believe that they won’t do it again. It will have to be very useful/different to get me back

Some are leaving GMail altogether. Judson Dunn reported:

@rmack my boyfriend deleted his long time gmail account for good 🙁

I was so concerned about exposing people in my GMail network during the first week after launch that I stayed off Buzz entirely until Monday afternoon. By then I felt that the worst privacy problems had been fixed, and I understood well enough how to tweak the settings that I could at least go in without doing harm to others. After playing with it a bit and poking around I posted some initial observations and invited the people in my network to respond. There are still plenty of issues – some people who claimed in Twitter that they had turned off Buzz are still there, and I think Buzz should make it easier for people to use pseudonyms or nicknames not tied to their email address if they prefer.  From Beijing, Jeremy Goldkorn of the influential media blog Danwei responded: “I like the way Buzz works now, and it seems to me that the privacy concerns have been addressed.”

I’ve noticed that some Chinese Buzz users have been using it to post and discuss material that has been censored by Chinese blog-hosting platforms and social networking sites. If Buzz becomes useful as a way to preserve and spread censored information around quickly, it seems to me that’s a plus as long as people aren’t being exposed in ways they don’t want. My friend Isaac Mao wrote:

It’s more important to Chinese to make information flowing rather than privacy concern this moment. With more hibernating animals in cave, we can’t tell too much on the risks about identity, but more on how to wake up them.

Buzz has unleashed some potentials on sharing which just follows my Sharism theory, people actually have much more stuff to share before they realize them.

But I agree with any conerns on privacy, including the risks that authority may trace publishers in China. It’s very much possible to be targeted once they were notified how profound the new tool is.

The “Great Firewall” is already at work on Buzz, at least in Beijing. While most people seem to be able to access Buzz through GMail on Chinese Internet connections, numerous people report from Beijing that at least some Google profiles – including mine and Isaac’s – are blocked, though people in Shanghai and Guangzhou say they’re not blocked. Others in China report having trouble posting comments to Buzz, though it’s unclear whether this is a technical issue with Buzz or a Chinese network blocking issue, or some strange combination of the two.

It will be interesting to see how things evolve, and whether activists in various countries find Buzz to be a useful alternative to Facebook and other platforms – or not. Whatever happens, I do think that Google fully deserves the negative press it has gotten and continues to get for the thoughtless way in which Buzz was rolled out. There are  senior people at Google whose job it is to focus on free expression issues, and others who work full time on privacy issues. Either the Buzz development team completely failed to consult with these people or were allowed to ignore them. I am inclined to believe the former instead of the latter, based on my interactions with the company through the Global Network Initiative and Google’s support for Global Voices. Call me biased or sympathetic if you want, but I don’t think that the company made a conscious dec
ision to ignore the risks it was creatin
g for human rights activists or people with abusive spouses – or anybody else with privacy concerns. However, if we do give Google the benefit of the doubt, then the only logical conclusion is that in this case, something about the company’s management and internal communications was so broken that the company was unable to prevent a new product from unintentionally doing evil. Nick Summers at Newsweek thinks the problem is broader:

Google is so convinced of the righteousness of its mission statement that it launches products heedlessly. Take Google Books—the company was so in thrall with its plan to make all hardbound knowledge searchable that it did not anticipate a $125 million legal challenge from publishers. With Google Wave, engineers got high on their own talk that they had invented a means of communication superior to e-mail—until Wave launched and users laughed at its baffling un-usability. Last week, with Buzz, Google seemed so bewitched by the possibilities of a Google-y take on social networking that it went live without thinking through the privacy implications.

Whatever the case may be in terms of Google’s internal thinking or intentions, we have a right to be concerned. Too many of us depend on Google for too many things. As I’ve written before, I believe Google has a responsibility to netizens around the world to develop more effective mechanisms to ensure that the concerns, interests, and rights of the world’s netizens are adequately incorporated into the development process.

I’d very much like to hear your ideas for how netizens’ concerns around the world – particularly from at-risk and marginalized communities who have the most to lose when Google gets things wrong – might be channeled to Google’s development teams and product managers. Rather than wait for Google to figure this out, are there mechanisms that we as netizens might be able to build?  Are there things we can proactively do to help companies like Google avoid doing evil? Can we help them to avoid hurting us – and also help them to maximize the amount of good they can do?

(Cross-posted from RConversation)

Mozilla Debates Whether to Trust Chinese CA

[Note our follow-up posts on this topic: Web Security Trust Models, and Web Certification Fail: Bad Assumptions Lead to Bad Technology]

Sometimes geeky technical details matter only to engineers. But sometimes a seemingly arcane technical decision exposes deep social or political divisions. A classic example is being debated within the Mozilla project now, as designers decide whether the Mozilla Firefox browser should trust a Chinese certification authority by default.

Here’s the technical background: When you browse to a secure website (typically at a URL starting with “https:”), your browser takes two special security precautions: it sets up a private, encrypted “channel” to the server, and it authenticates the server’s identity. The second step, authentication, is necessary because a secure channel is useless if you don’t know who is on the other end. Without authentication, you might be talking to an impostor.

Suppose you’re connecting to https://mail.google.com, to pick up your Gmail. To authenticate itself to you, the server will (1) do some fancy math to prove to you that it knows a certain encryption key, and (2) present you with a digital certificate (or “cert”) attesting that only Google knows that encryption key. The cert is created by a Certification Authority (“CA”), which asserts that it has done the necessary due diligence to establish that the designated encryption key is known only to Google Inc.

If the CA is competent and honest, then you can rely on the cert, and your connection will be secure. But a dishonest CA can trick you into talking to an impostor site, so you need to be cautious about which CAs you trust. Your browser comes preinstalled with a list of CAs whom it will trust. In principle you can change this list, but almost nobody does. So browser vendors effectively decide which CAs their users will trust.

With this background in mind, let’s unpack the Mozilla debate. What set off the debate was the addition of the China Internet Network Information Center (CNNIC) as a trusted CA in Firefox. CNNIC is not part of the Chinese government but many people assert that it would be willing to act in concert with the Chinese government.

To see why this is worrisome, let’s suppose, just for the sake of argument, that CNNIC were a puppet of the Chinese government. Then CNNIC’s status as a trusted CA would give it the technical power to let the Chinese government spy on its citizens’ “secure” web connections. If a Chinese citizen tried to make a secure connection to Gmail, their connection could be directed to an impostor Gmail site run by the Chinese government, and CNNIC could give the impostor a cert saying that the government impostor was the real Gmail site. The Chinese citizen would be fooled by the fake Gmail site (having no reason to suspect anything was wrong) and would happily enter his Gmail password into the impostor site, giving the Chinese government free run of the citizen’s email archive.

CNNIC’s defenders respond that any CA could do such a thing. If the problem is that CNNIC is too close to a government, what about the CAs already on the Firefox CA list that are governments? Isn’t CNNIC being singled out because it is Chinese? Doesn’t the country with the largest Internet population deserve at least one slot among the dozens of already trusted CAs? These are all good questions, even if they’re not the whole story.

Mozilla’s decision touches deep questions of fairness, trust, and institutional integrity that I won’t even pretend to address in this post. No single answer will be right for all users.

Part of the problem is that the underlying technical design is fragile. Any CA can certify to any user that any server owns any name, so the consequences of a misplaced trust decision are about as bad as they can be. It’s tempting to write this off as bonehead design, but in truth the available design options are all unattractive.