November 24, 2024

Breaking Vanish: A Story of Security Research in Action

Today, seven colleagues and I released a new paper, “Defeating Vanish with Low-Cost Sybil Attacks Against Large DHTs“. The paper’s authors are Scott Wolchok (Michigan), Owen Hofmann (Texas), Nadia Heninger (Princeton), me, Alex Halderman (Michigan), Christopher Rossbach (Texas), Brent Waters (Texas), and Emmett Witchel (Texas).

Our paper is the next chapter in an interesting story about the making, breaking, and possible fixing of security systems.

The story started with a system called Vanish, designed by a team at the University of Washington (Roxana Geambasu, Yoshi Kohno, Amit Levy, and Hank Levy). Vanish tries to provide “vanishing data objects” (VDOs) that can be created at any time but will only be usable within a short time window (typically eight hours) after their creation. This is an unusual kind of security guarantee: the VDO can be read by anybody who sees it in the first eight hours, but after that period expires the VDO is supposed to be unrecoverable.

Vanish uses a clever design to do this. It takes your data and encrypts it, using a fresh random encryption key. It then splits the key into shares, so that a quorum of shares (say, seven out of ten shares) is required to reconstruct the key. It takes the shares and stores them at random locations in a giant worldwide system called the Vuze DHT. The Vuze DHT throws away items after eight hours. After that the shares are gone, so the key cannot be reconstructed, so the VDO cannot be decrypted — at least in theory.

What is this Vuze DHT? It’s a worldwide peer-to-peer network, containing a million or so computers, that was set up by Vuze, a company that uses the BitTorrent protocol to distribute (licensed) video content. Vuze needs a giant data store for its own purposes, to help peers find the videos they want, and this data store happens to be open so that Vanish can use it. The million-computer extent of the Vuze data store was important, because it gave the Vanish designers a big haystack in which to hide their needles.

Vanish debuted on July 20 with a splashy New York Times article. Reading the article, Alex Halderman and I realized that some of our past thinking about how to extract information from large distributed data structures might be applied to attack Vanish. Alex’s student Scott Wolchok grabbed the project and started doing experiments to see how much information could be extracted from the Vuze DHT. If we could monitor Vuze and continuously record almost all of its contents, then we could build a Wayback Machine for Vuze that would let us decrypt VDOs that were supposedly expired, thereby defeating Vanish’s security guarantees.

Scott’s experiments progressed rapidly, and by early August we were pretty sure that we were close to demonstrating a break of Vanish. The Vanish authors were due to present their work in a few days, at the Usenix Security conference in Montreal, and we hoped to demonstrate a break by then. The question was whether Scott’s already heroic sleep-deprived experimental odyssey would reach its destination in time.

We didn’t want to ambush the Vanish authors with our break, so we took them aside at the conference and told them about our preliminary results. This led to some interesting technical discussions with the Vanish team about technical details of Vuze and Vanish, and about some alternative designs for Vuze and Vanish that might better resist attacks. We agreed to keep them up to date on any new results, so they could address the issue in their talk.

As it turned out, we didn’t establish a break before the Vanish team’s conference presentation, so they did not have to modify their presentation much, and Scott finally got to catch up on his sleep. Later, we realized that evidence to establish a break had actually been in our experimental logs before the Vanish talk, but we hadn’t been clever enough to spot it at the time. Science is hard.

Some time later, I ran into my ex-student Brent Waters, who is now on the faculty at the University of Texas. I mentioned to Brent that Scott, Alex and I had been studying attacks on Vanish and we thought we were pretty close to making an attack work. Amazingly, Brent and some Texas colleagues (Owen Hoffman, Christopher Rossbach, and Emmett Witchel) had also been studying Vanish and had independently devised attacks that were pretty similar to what Scott, Alex, and I had.

We decided that it made sense to join up with the Texas team, work together on finishing and testing the attacks, and then write a joint paper. Nadia Heninger at Princeton did some valuable modeling to help us understand our experimental results, so we added her to the team.

Today we are releasing our joint paper. It describes our attacks and demonstrates that the attacks do indeed defeat Vanish. We have a working system that can decrypt Vanishing data objects (made with the original version of Vanish) after they are supposedly unrecoverable.

Our paper also discusses what went wrong in the original Vanish design. The people who designed Vanish are smart and experienced, but they obviously made some kind of mistake in their original work that led them to believe that Vanish was secure — a belief that we now know is incorrect. Our paper talks about where we think the Vanish authors went wrong, and what security practitioners can learn from the Vanish experience so far.

Meanwhile, the Vanish authors went back to the drawing board and came up with a bunch of improvements to Vanish and Vuze that make our attacks much more expensive. They wrote their own paper about their experience with Vanish and their new modifications to it.

Where does this leave us?

For now, Vanish should be considered too risky to rely on. The standard for security is not “no currently demonstrated attacks”, it is “strong evidence that the system resists all reasonable attacks”. By updating Vanish to resist our attacks, the Vanish authors showed that their system is not a dead letter. But in my view they are still some distance from showing that Vanish is secure . Given the complexity of underlying technologies such as Vuze, I wouldn’t be surprised if more attacks turn out to be possible. The latest version of Vanish might turn out to be sound, or to be unsound, or the whole approach might turn out to be flawed. It’s too early to tell.

Vanish is an interesting approach to a real problem. Whether this approach will turn out to work is still an open question. It’s good to explore this question — and I’m glad that the Vanish authors and others are doing so. At this point, Vanish is of real scientific interest, but I wouldn’t rely on it to secure my data.

[Update (Sept. 30, 2009): I rewrote the paragraphs describing our discussions with the Vanish team at the conference. The original version may have given the wrong impression about our intentions.]

Netflix's Impending (But Still Avoidable) Multi-Million Dollar Privacy Blunder

In my last post, I had promised to say more about my article on the limits of anonymization and the power of reidentification. Although I haven’t said anything for a few weeks, others have, and I especially appreciate posts by Susannah Fox, Seth Schoen, and Nate Anderson. Not only have these people summarized my article well, they have also added a lot of insightful commentary, and I commend these three posts to you.

Today brings news relating to one of the central examples in my paper: Netflix has announced plans to commit a privacy blunder that could cost it millions of dollars in fines and civil damages.

In my article, I focus on Netflix’s 2006 decision to release millions of records containing the movie rating preferences of “anonymized” users to the public, in order to fuel a crowd-sourcing competition called the Netflix Prize. The Netflix Prize has been a huge win for Netflix’s public relations, but it has also been a win for academics, who have used the data to improve the science of guessing human behavior from past preferences.

The Netflix Prize was also a watershed event for reidentification research because Arvind Narayanan and Vitaly Shmatikov of U. Texas revealed that they could reidentify some of the “anonymized” users with ease, proving that we are more uniquely tied to our movie rating preferences than intuition would suggest. In my paper, I argue that we should worry about this privacy breach even if we don’t think movie ratings are terribly sensitive, because it can be used to enable other, more terrifying privacy breaches.

I never argue, however, that Netflix deserves punishment or sanction for having released this data. In my opinion, Netflix acted pretty responsibly. It consulted with computer scientists in a (failed) attempt to anonymize successfully. It tried perturbing the data in order to make reidentification harder. And other experts seem to have been surprised by how easy it was for Narayanan and Shmatikov to reidentify. Even with the benefit of hindsight, I find nothing to blame in how Netflix handled the privacy implications of what it did.

Although I give Netflix a pass for its past privacy breach, I am astonished to learn from the New York Times that the company plans a second act:

The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals’ “taste profiles,” the company said. The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months.

Netflix should cancel this new, irresponsible contest, which it has dubbed Netflix Prize 2. Researchers have known for more than a decade that gender plus ZIP code plus birthdate uniquely identifies a significant percentage of Americans (87% according to Latanya Sweeney’s famous study.) True, Netflix plans to release age not birthdate, but simple arithmetic shows that for many people in the country, gender plus ZIP code plus age will narrow their private movie preferences down to at most a few hundred people. Netflix needs to understand the concept of “information entropy”: even if it is not revealing information tied to a single person, it is revealing information tied to so few that we should consider this a privacy breach.

I have no doubt that researchers will be able to use the techniques of Narayanan and Shmatikov, together with databases revealing sex, zip code, and age, to tie many people directly to these supposedly anonymized new records.

Because of this, if it releases the data, Netflix might be breaking the law. The Video Privacy Protection Act (VPPA), 18 USC 2710 prohibits a “video tape service provider” (a broadly defined term) from revealing “personally identifiable information” about its customers. Aggrieved customers can sue providers under the VPPA and courts can order “not less than $2500” in damages for each violation. If somebody brings a class action lawsuit under this statute, Netflix might face millions of dollars in damages.

Additionally, the FTC might also decide to fine Netflix for violating its privacy policy as an unfair business practice.

Either a lawsuit under the VPPA or an FTC investigation would turn, in large part, on one sentence in Netflix’s privacy policy: “We may also disclose and otherwise use, on an anonymous basis, movie ratings, consumption habits, commentary, reviews and other non-personal information about customers.” If sued or investigated, Netflix will surely argue that its acts are immunized by the policy, because the data is disclosed “on an anonymous basis.” While this argument might have carried the day in 2006, before Narayanan and Shmatikov conducted their study, the argument is much weaker in 2009, now that Netflix has many reasons to know better, including in part, my paper and the publicity surrounding it. A weak argument is made even weaker if Netflix includes the kind of data–ZIP code, age, and gender–that we have known for over a decade fails to anonymize.

The good news is Netflix has time to avoid this multi-million dollar privacy blunder. As far as I can tell, the Netflix Prize 2 has not yet been launched.

Dear Netflix executives: Don’t do this to your customers, and don’t do this to your shareholders. Cancel the Netflix Prize 2, while you still have the chance.

Anonymization FAIL! Privacy Law FAIL!

I have uploaded my latest draft article entitled, Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization to SSRN (look carefully for the download button, just above the title; it’s a little buried). According to my abstract:

Computer scientists have recently undermined our faith in the privacy-protecting power of anonymization, the name for techniques for protecting the privacy of individuals in large databases by deleting information like names and social security numbers. These scientists have demonstrated they can often “reidentify” or “deanonymize” individuals hidden in anonymized data with astonishing ease. By understanding this research, we will realize we have made a mistake, labored beneath a fundamental misunderstanding, which has assured us much less privacy than we have assumed. This mistake pervades nearly every information privacy law, regulation, and debate, yet regulators and legal scholars have paid it scant attention. We must respond to the surprising failure of anonymization, and this Article provides the tools to do so.

I have labored over this article for a long time, and I am very happy to finally share it publicly. Over the next week, or so, I will write a few blog posts here, summarizing the article’s high points and perhaps expanding on what I couldn’t get to in a mere 28,000 words.

Thanks to Ed, David, and everybody else at Princeton’s CITP for helping me develop this article during my visit earlier this year.

Please let me know what you think, either in these comments or by direct email.

If You're Going to Track Me, Please Use Cookies

Web cookies have a bad name. People often complain — with good reason — about sites using cookies to track them. Today I want to say a few words in favor of tracking cookies.

[Technical background: An HTTP “cookie” is a small string of text. When your web browser gets a file from a site, the site can send along a cookie. Your browser stores the cookie. Later, if the browser gets another file from the same site, the browser will send along the cookie.]

What’s important about cookies, for our purposes, is that they allow a site to tell when it’s seeing the same browser (and therefore, probably, the same user) that it saw before. This has benign uses — it’s needed to implement the shopping cart feature of e-commerce sites (so the site knows which cart is yours) and to remember that you have logged in to a site so you don’t have to log in over and over.

The dark side of cookies involves “hidden” sites that track your activities across the web. Suppose you go to A.com, and A.com’s site includes a banner ad that is provided by the advertising service AdService.com. Later, you go to B.com, and B.com also includes a banner ad provided by AdService.com. When you’re reading A.com and your browser goes to AdService.com to get an ad, AdService.com gives you a cookie. Later, when you’re reading B.com and your browser goes back to AdService.com to get an ad, AdService.com will see the cookie it gave you earlier. This will allow AdService.com to link together your visits to A.com and B.com. Ad services that place ads on lots of sites can link together your activities across all of those sites, by using a “tracking cookie” in this way.

The obvious response is to limit or regulate the use of tracking cookies — the government could limit them, industry could self-regulate, or users could shun sites that associate themselves with tracking cookies.

But this approach could easily backfire. It turns out that there are lots of ways for a site to track users, by recognizing something distinctive about the user’s computer or by placing a unique marker on the computer and recognizing it later. These other tracking mechanisms are hard to detect — new tracking methods are discovered regularly — and unlike cookies they can be hard for users to manage. The tools for viewing, blocking, and removing cookies are far from perfect, but at least they exist. Other tracking measures leave users nearly defenseless.

My attitude, as a user, is that if a site is going to track me, I want them to do it openly, using cookies. Cookies offer me less transparency and control that I would like, but the alternatives are worse.

If I were writing a self-regulation code for the industry, I would have the code require that cookies be the only means used to track users across sites.

My Testimony on Behavioral Advertising: Post-Mortem

On Thursday I testified at a House hearing about online behavioral advertising. (I also submitted written testimony.)

The hearing started at 10:00am, gaveled to order by Congressman Rush, chair of the Subcommittee on Commerce, Trade, and Consumer Protection. He was flanked by Congressman Boucher, chair of the Subcommittee on Communications, Technology, and the Internet , and Congressmen Steans and Radanovich, the Ranking Members (i.e., the highest-ranking Republican members) of the subcommittees.

First on the agenda we had opening statements by members of the committees. Members had either two or five minutes to speak, and the differing perspectives of the members became clear during these statements. The most colorful statement was by Congressman Barton, who supplemented his interesting on-topic statement with a brief digression about the Democrats vs. Republicans charity baseball game which was held the previous day. The Democrats won, to Congressman Barton’s chagrin.

After the opening statements, the chair recessed the hearings, so the Members could go to the House floor to vote. Members of the House must be physically present in the House chamber in order to vote, so it’s not unusual for hearings to recess when there is a floor vote. The House office buildings have buzzers, not unlike the bells that mark the ends of periods in a school, which alert everybody when a vote starts. The Members left the hearing room, and we all waited for the vote(s) to end, so our hearing could resume. The time was 10:45 AM.

What happened next was very unusual indeed. The House held vote after vote, more than fifty votes in total, as the day stretched on, hour after hour. They voted on amendments, on motions to reconsider the votes on the amendments, on other motions — at one point, as far as I could tell, they were voting on a motion to reconsider a decision to table an appeal of a procedural decision of the chair. To put it bluntly, the Republicans were staging a kind of work stoppage. They did this, I hear, to protest an unusual procedural limitation that the Democrats had placed on the handling of the appropriations bill that was currently before the House. I don’t know enough about the norms of House procedure to say which party had the better argument here — but I do know that the recess in our hearing lasted eight and a half hours.

These were not the most exciting eight and a half hours I have experienced. As the day stretched on, we did get a chance to wander around and do a little light tourism. Probably the highlight was when we saw Angelina Jolie in the hallway.

When we reconvened at 7:15 PM, the room, which had been overflowing with spectators in the morning, was mostly empty. The members of the committees, though, made a pretty good showing, which was especially impressive given that it was Thursday evening, when many Members hightail it back home to their districts. Late in the day, after a day that must have been frustrating for everybody, we sat down to business and had a good, substantive hearing. There were no major surprises — there rarely are at hearings — but everyone got a chance to express their views, and the members asked substantive questions.

Thinking back on the hearing, I did realize one thing that may have been missing. The panel of witnesses included three companies, Yahoo, Google, and Facebook, that are both ad services and content providers. There was less attention to situations where the ad service and the content provider are separate companies. In this latter case, where the ad service does not have a direct relationship with the consumer, so the market pressure on the ad service to behave well is attenuated. (There is still some pressure, through the content provider, who wants to stay in the good graces of consumers, but an indirect link is not as effective as a direct one would be.) Yahoo, Google, and Facebook are household names, and we would naturally expect them to pay more careful attention to the desires of consumers and Congress than lower-profile ad services would.

Witnesses have the opportunity to submit further written testimony. Any suggestions on what I might discuss?