January 20, 2025

More Privacy, Bit by Bit

Before the Holidays, Yahoo got a flurry of good press for the announcement that it would (as the LA Times puts it) “purge user data after 90 days.” My eagle-eyed friend Julian Sanchez noticed that the “purge” was less complete than privacy advocates might have hoped. It turns out that Yahoo won’t be deleting the contents of its search logs. Rather, it will merely be zeroing out the last 8 bits of users’ IP addresses. Julian is not impressed:

dropping the last byte of an IP address just means you’ve narrowed your search space down to (at most) 256 possibilities rather than a unique machine. By that standard, this post is anonymous, because I guarantee there are more than 255 other guys out there with the name “Julian Sanchez.”

The first three bytes, in the majority of cases, are still going to be enough to give you a service provider and a rough location. Assuming every address in the range is in use, dropping the least-significant byte just obscures which of the 256 users at that particular provider is behind each query. In practice, though, the search space is going to be smaller than that, because people are creatures of habit: You’re really working with the pool of users in that range who perform searches on Yahoo. If your not-yet-anonymized logs show, say, 45 IP addreses that match those first three bytes making routine searches on Yahoo (17.6% of the search market x 256 = 45) you can probably safely assume that an “anonymized” IP with the same three leading bytes is one of those 45. If different users tend to exhibit different usage patterns in search time, clustering of queries, expertise with Boolean operators, or preferred natural language, you can narrow it down further.

I think this isn’t quite fair to Yahoo. Dropping the last eight bits of the IP address certainly doesn’t protect privacy as much as deleting log entries entirely, but it’s far from useless. To start with, there’s often not a one-to-one correspondence between IP addresses and Internet users. Often a single user has multiple IPs. For example, when I connect to the Princeton wireless network, I’m dynamically assigned an IP address that may not be the same as the IP address I used the last time I logged on. I also access the web from my iPhone and from hotels and coffee shops when I travel. Conversely, several users on a given network may be sharing a single IP address using a technology called network address translation. So even if you know the IP address of the user who performed a particular search, that may simply tell you that the user works for a particular company or connected from a particular coffee shop. Hence, tracking a particular user’s online activities is already something of a challenge, and it becomes that much harder if several dozen users’ online activities are scrambled together in Yahoo!’s logs.

Now, whether this is “enough” privacy depends a lot on what kind of privacy problem you’re worried about. It seems to me that there are three broad categories of privacy concerns:

  • Privacy violations by Yahoo or its partners: Some people are worried that Yahoo itself is tracking their online activities, building an online profile about them, and selling this information to third parties. Obviously, Yahoo’s new policy will do little to allay such concerns. Indeed, as David Kravets points out, Yahoo will have already squeezed all the personal information it can out of those logs before it scours them. If you don’t trust Yahoo or its business partners, this move isn’t going to make you feel very much safer.
  • Data breaches: A second concern involves cases where customer data falls into the wrong hands due to a security breach. In this case, it’s not clear that search engine logs are especially useful to data thieves in the first place. Data thieves are typically looking for information such as credit card and Social Security numbers that can make them a quick buck. People rarely type such information into search boxes. Some searches may be embarrassing to users, but they probably won’t be so embarrassing as to enable blackmail or extortion. So search logs are not likely to be that useful to criminals, whether or not they are “anonymized.”
  • Court-ordered information release: This is the case where the new policy could have the biggest effect. Consider, for example, a case where the police seek a suspect’s search results. The new policy will help protect privacy in three ways: first, if Yahoo! can’t cleanly filter search logs by IP address, judges may be more reluctant to order the disclosure of several dozen users’ search results just to give police information from a single suspect. Second, scrubbing the last byte of the IP address will make searching through the data much more difficult. Finally, the resulting data will be less useful in the court of law, because prosecutors will need to convince a jury that a given search was performed by the defendant rather than another user who happened to have a similar IP address. At the margin, then, Yahoo’s new policy seems likely to significantly enhance user privacy against government information requests. The same principle applies in the case of civil suits: the recording and movie industries, for example, will have a harder time using Yahoo!’s search logs as evidence that a user was engaged in illegal file-sharing.

So based on the small amount of information Yahoo has made available, it seems that the new policy is a real, if small, improvement in users’ privacy. However, it’s hard to draw any definite conclusions without more specific information about what information Yahoo! is saving. Because anonymizing data is a lot harder than people think. AOL learned this the hard way in 2006 when “anonymized” search results were released to researchers. People quickly noticed that you could figure out who various users were by looking at the contents of their searches. The data wasn’t so anonymous after all.

One reason AOL’s data wasn’t so anonymous is that AOL had “anonymized” the data set by assigning each user a unique ID. That meant people could look at all searches made by a single user and find searches that gave clues to the user’s identity. Had AOL instead stripped off the user information without replacing it, it would have been much harder to de-anonymize the data because there would be no way to match up different searches by the same user. If Yahoo’s logs include information linking each user’s various searches together, then even deleting the IP address entirely probably won’t be enough to safeguard user privacy. On the other hand, if the only user-identifying information is the IP address, then stripping off the low byte of the IP address is a real, if modest, privacy enhancement.

Taking Advantage of Citizen Contrarians

In my last post, I argued that sifting through citizens’ questions for the President is a job best done outside of government. More broadly, there’s a class of input that is good for government to receive, but that probably won’t be welcome at the staff level, where moment-to-moment success is more of a concern than long-term institutional thriving. Tough questions from citizens are in this category. So is unexpected, challenging or contrarian citizen advice or policy input. A flood of messages that tell the President “I’m in favor of what you already plan to do,” perhaps leavened with a sprinkling of “I respectfully disagree, but still like you anyway,” would make for great PR, and better yet, since such messages don’t offer action guiding advice, they don’t actually drive any change whatsoever in what anyone in government—from the West Wing to the furthest corners of the executive branch—does.

Will the new administration set things up to run this way? I don’t know. Certainly, the cookie-cutter blandness of their responses to the first round of online citizen questions is not a promising sign. There’s no question that Obama himself sees some value in real, tough questions that come from the masses. But the immediate practical advantages of a choir that echoes the preacher may be a much more attractive prospect for his staff then the scrambling, search, and actual policy rethought that might have to follow tough questions or unexpected advice.

This outcome would be a lost opportunity precisely because there are pockets of untapped expertise, uncommon wisdom, and bright ideas out there. Surfacing these insights—the inputs that weren’t already going to be incorporated into the policy process, the thoughts that weren’t talking points during the campaign, the things we didn’t already know—is precisely what the new collaborative technologies have made possible.

On the other hand, in order for this to work, we need to be able to regard (at least some of) the surprising, unexpected or quirky citizen inputs as successes for the system that attracted them, rather than failures. We can already find out what the median voter thinks, without all these fancy new systems, and in any case, his or her opinion is unlikely to add new or unexpected value to the policy process.

Obamacto.org, a potential model for external sites that gather citizen input for government, has a leaderboard of suggested priorities for the new CTO, voted up by visitors to the site. The first three suggestions are net neutrality regulation, Patriot Act repeal and DMCA repeal—unsurprising major issues. Arguably, if enough people took part in the online voting, there would be some value in knowing how the online group had prioritized these familiar requests. But with the fourth item, things get interesting: it reads “complete the job on metrication that Ronald Reagan defunded.”

On the one hand, my first reaction to this is to laugh: Regardless of whether or not moving to the metric system would be a good idea, it’s something that doesn’t have nearly the political support today that would be needed in order for it to be a plausible priority for Obama’s CTO. Put another way, there’s no chance that persuading America to do this is the best use of the new administration’s political capital.

On the other hand, maybe that’s what these sorts of online fora are for: Changing which issues are on the table, and how we think about them. The netroots turned net neutrality into a mainstream political issue, and for all I know they (or some other constellation of political forces) could one day do the same for the drive to go metric.

Readers, commenters: What do you think? Are quirky inputs like the suggestion that Obama’s CTO focus on metrication a hopeful sign for the value new deliberative technologies can add in the political process? Or, are they a sign that we haven’t figured out how these technologies should work or how to use them?

Government Shouldn't "Help" Citizens Pick Tough Questions for Obama

A couple of weeks ago, Julian Sanchez at Ars Technica, Ben Smith at Politico and others noted a disturbing pattern on the incoming Obama administration’s Change.gov website: polite but pointed user-submitted questions about the Blagojevich scandal and other potentially uncomfortable topics were being flagged as “inappropriate” by other visitors to the site.

In less than a week, more than a million votes-for-particular-questions were cast. The transition team closed submissions and posted answers to the five most popular questions. The usefulness and interest of these answers was sharply limited: They reiterated some of the key talking points and platform language of Obama’s campaign without providing any new information. The transition site is now hosting a second round of this process.

It shouldn’t surprise us that there are, among the Presdient-elect’s many supporters, some who would rather protect their man from inconvenient questions. And for all the enthusiastic talk about wide-open debate, a crowdsourced system that lets anyone flag an item as inappropriate can give these few a perverse kind of veto over the discussion.

If the site’s operators recognize this kind of deliberative narrowing as a problem, there are ways to deal with it. One could require a consensus judgement of “inappropriateness” by a cross-section of Change.gov users that is large enough, or is diverse with respect to geography, time of visit, amount of past involvement in the site, or any number of other criteria before taking a question out of circulation. Questions that have been preliminarily flagged as inappropriate could enter a secondary moderation queue where their appropriateness can be debated, leading to a considered “up or down” vote on whether a given question belongs in the mix. The Obama transition team could even crowdsource this problem itself, looking for lay input (or input from experts at places like Digg) about how to make sure that reasonable-but-pointed questions stay in, while off topic, off color, or otherwise unacceptable ones remain out.

But what are the incentives of the new administration’s online team? They might well find it convenient, as Julian writes, to “crowdsource a dodge” to inconvenient questions–if the users of Change.gov adopt an expansive view of “inappropriateness,” the Obama team will likely benefit slightly from soft, supportive questions in the near term, though it will run the risk of allowing substantive problems, or citizen concerns, to fester over the longer term. And that tradeoff could hold much more appeal for the median administration staffer than it does for the median American.

In other words, having the administration’s own tech people manage the moderation of questions directed at the President may be like having the fox guard the henhouse. I agree that even this is much more open than recent past administrations, but I think the more interesting question here is what would be ideal.

I suspect this key plank of the new administration’s plans will never be able to be fully realized within government. The President needs to answer questions that a nonzero number of his most enthusiastic supporters are willing to characterize as “inappropriate.” And for that to happen, the online moderation needs to take place outside of .gov. A collective move toward one of the .org alternatives, for this key activity of sifting questions, would be a great first step. That way, the goal of finding tough but honest questions can plausibly sit paramount.

Researchers Show How to Forge Site Certificates

Today at the Chaos Computing Congress, a group of researchers (Alex Sotirov, Marc Stevens, Jake Appelbaum, Arjen Lenstra, Benne de Weger, and David Molnar) announced that they have found a way to forge website certificates that will be accepted as valid by most browsers. This means that they can successfully impersonate any website, even for secure connections.

Let me unpack that for non-experts.

One of the cornerstones of web security is the use of secure connections. When your browser makes a secure connection to (say) Amazon and gets a page to display, the browser displays in its address bar a URL like “https://www.amazon.com”. The “https” indicates that the secure (https) protocol was used, and the browser also displays a happy blue lock or key icon to tell you the connection was secured.

The browser cooperates with Amazon’s web server to secure the connection via a two-step process. First, the two computers negotiate a shared secret key that they can use to communicate privately, using crypto tricks that I won’t describe here. Second, your browser authenticates Amazon’s web server, that is, it assures itself that the party on the other end of the connection is the genuine Amazon.com server.

Amazon has a digital certificate that it sends to your browser, as part of proving its identity. The certificate is issued by a party called a certification authority or CA. Your browser comes pre-programmed with a list of CAs its trusts; you can change the list but hardly anyone does. If your browser makes an encrypted connection to “amazon.com”, and the party on the other end of the connection owns a certificate for the name “amazon.com”, and that certificate was issued by a CA that your browser trusts, then your browser will conclude that it has a secure connection to amazon.com.

Now we can understand what the researchers accomplished: they showed how to forge a certificate corresponding to any address on the Web. For example, they can forge a certificate that allows themselves, or you, or me, or anybody, to impersonate amazon.com, or freedom-to-tinker.com, or maybe even fbi.gov. That is supposed to be impossible, for obvious reasons.

The forged certificates will say they were issued by a CA called “Equifax Secure Global eBusiness”, which is trusted by the major browsers. The forged certificates will be perfectly valid; but they will have been made by forgers, not by the Equifax CA.

To do this, the researchers exploited a cryptographic weakness in one of the digital signature methods, “MD5 with RSA”, supported by the Equifax CA. The first step in this digital signature method is to compute the hash (strictly speaking, the cryptographic hash) of the certificate contents.

The hash is a short (128-bit) code that is supposed to be a kind of unique digest of the certificate contents. To be secure, the hash method has to have several properties, one of which is that it should be infeasible to find a collision, that is, to find two values A and B which have the same hash.

It was already known how to find collisions in MD5, but the researchers improved the existing collision-finding methods, so that they can now find two values R and F that have the same hash, where R is a “real” certificate that the CA will be willing to sign, and F is a forged certificate. This is deadly, because it means that a digital signature on R will also be a valid signature on F — so the attacker can ask the CA to sign the real certificate R, then copy the resulting signature onto F — putting a valid CA signature onto a certificate that the CA would never voluntarily sign.

To demonstrate this, the researchers created a forged certificate signed by the Equifax CA. For safety, they made the forged certificate expire in the past and point to a harmless site. But it’s clear from their description that they can forge a certificate for any site they want.

Whose fault is this? Partly it’s a consequence of problems with the MD5 hash method. It’s been known for a few years that MD5 is in the process of melting down, so prudent designers have been moving away from MD5, replacing it with newer, better hash methods. Similarly, prudent CAs should not be signing certificates that use MD5-based signature methods; instead they should insist on signature methods involving stronger hashes. The Equifax CA did not follow this precaution.

The problem can be fixed, for now, by having CAs refuse to create new MD5-based signatures. But this is a sobering reminder that the certification process that underlies web site authentication — a mechanism we all rely upon daily — is far from bulletproof.

Internet voting-a-go-go

Yes, we know that there’s no such thing as a perfect voting system, but the Estonians are doing their best to get as far away from perfection as possible. According to the latest news reports, Estonia is working up a system to vote from mobile phones. This follows on their earlier web-based Internet voting. What on earth are they thinking?

Let’s review some basics. The Estonian Internet voting scheme builds on the Estonian national ID card, which is a smartcard. You get the appropriate PCMCIA adapter and you can stick it into your laptop. Then, through some kind of browser plug-in, it can authenticate you to the voting server. No card, no voter impersonation. The Estonian system “avoids” the problem of voter bribery / coercion by allowing the voter to cast as many votes as they want, but only the last one actually counts. As I understand it, a voter may also arrive, on election day, at some sort of official polling place and substitute a paper ballot for their prior electronic ballot.

The threats to this were and are obvious. What if some kind of malware/virus/worm contraption infects your web browser and/or host operating system, waits for you to connect to the election server, and then quietly substitutes its own choices for yours? You would never know that the attack occurred and thus would never think to do anything about it. High tech. Very effective. And, of course, somebody can still watch over your shoulder while you vote. At that point, they just need to keep you from voting again. They could accomplish this by simply having you vote at the last minute, under supervision, or they could “borrow” your ID card until it’s too late to vote again. Low tech. Still effective.

But wait, there’s more! The central database must necessarily have your vote recorded alongside your name in order to allow subsequent votes to invalidate earlier votes. That means they’ve almost certainly got the technical means to deanonymize your vote. Do you trust your government to have a database that says exactly for whom you voted? Even if the vote contents are somehow encrypted, the government has all the necessary key material to decrypt it. (And, an aforementioned compromised host platform could be leaking this data, regardless.)

Okay, what about voting by cellular telephone? A modern cell phone is really no different from a modern web browser. An iPhone is running more-or-less the same OS X and Safari browser that’s featured on Apple’s Mac products. Even non-smart-phones tend to have an environment that’s powerful and general-purpose. There’s every reason to believe that these platforms are every bit as vulnerable to software attacks as we see with Windows systems. Just because hackers aren’t necessarily targeting these systems doesn’t mean they couldn’t. Ultimately, that means that the vulnerabilities of the phone system are exactly the same as the web system. No better. No worse.

Of course, crypto can be done in a much more sophisticated fashion. One Internet voting system, Helios, is quite sophisticated in this fashion, doing end-to-end crypto in JavaScript in your browser. With its auditability, Helios gives you the chance to challenge the entire client/server process to prove that it maintained your vote’s integrity. There’s nothing, however, in Helios to prevent an evil browser from leaking how you voted, thus compromising your anonymity. An evil election server could possibly be prevented from compromising your anonymity, depending on how the decryption keys are managed, but all the above privacy concerns still apply.

Yes, of course, Internet and cell-phone voting have lots of appeal. Vote from anywhere! At any time! If Estonia did more sophisticated cryptography, they could at least have a hope at getting some integrity guarantees (which they appear to be lacking, at present). Estonians have absolutely no privacy guarantees and thus insufficient protection from bribery and coercion. And we haven’t even scratched the surface of denial-of-service attacks. In 2007, Estonia suffered a large, coordinated denial-or-service attack, allegedly at the hands of Russian attackers. I’m reasonably confident that they’re every bit as vulnerable to such attacks today, and cell-phone voting would be no less difficult for resourceful attackers to disrupt.

In short, if you care about voter privacy, to defeat bribery and coercion, then you want voters to vote in a traditional polling place. If you care about denial of service, then you want these polling places to be operable even if the power goes out. If you don’t care about any of that, then consider the alternative. Publish in the newspaper a list of every voter and how they voted, for all the world to see, and give those voters a week to submit any corrections they might desire. If you were absolutely trying to maximize election integrity, nothing would beat it. Of course, if you feel that publishing such data in the newspaper could cause people to be too scared to vote their true preferences, then maybe you should pay more attention to voter privacy.

(More on this from Eric Rescorla’s Educated Guesswork.)