August 19, 2019

The Gizmodo Warrant: Searching Journalists in the Terabyte Age

Last Friday night, police officers in California used a warrant to search the home of Jason Chen, the Gizmodo blogger who wrote about the iPhone prototype found in a Redwood City bar. Orin Kerr has written an interesting post assessing the legality of the search. I wanted to touch on an important issue he didn’t discuss: Whether the search the police are conducting is unconstitutionally overbroad.

Orin discusses two laws that specifically shield journalists from being the target of a search, the California Reporter’s Shield Law, found jointly at California Penal Code 1524(g) and California Evidence Code 1070, and the federal Privacy Protection Act (PPA), 42 U.S.C. 2000aa. Both laws were written to limit the impact of Zurcher v. Stanford Daily, a U.S. Supreme Court case authorizing the use of a warrant to search a newspaper’s offices. The Supreme Court decided Zurcher in 1978, and Congress enacted the PPA in 1980 (and amended it in unrelated ways in 1996). I’m not sure when the California law was enacted, but I bet it’s of similar vintage. In other words, all of the rules that govern police searches of news offices were created in the age of typewriters, desks, filing cabinets, and stacks of paper.

Now, flash forward thirty years. The police who searched Jason Chen’s home seized the following: A macbook, HP server, two Dell desktop computers, iPad, ThinkPad, two MacBook Pros, IOmega NAS, three external hard drives, and three flash drives. They also seized other storage-containing devices, including two digital cameras and two smart phones. If Jason Chen’s computing habits are anything like mine, the police likely seized many terabytes of disk space, storing hundreds of thousands (millions?) of files, containing information stretching back years. And they took all of this information to investigate an alleged crime (the sale of the iPhone prototype) that could not have happened more than 37 days before the search (the iPhone was found on March 18th), which they learned about from a blog post published four days before the search.

I’m deeply concerned about overbreadth as the police begin to search through these terabytes of information. The police now possess, intermingled with the evidence of the alleged crime they are investigating, hundreds of thousands of documents belonging to a journalist/blogger that are utterly irrelevant to their investigation. Jason Chen has been blogging for Gizmodo since 2006, and he’s probably written hundreds of stories. The police likely have thousands of email messages revealing confidential sources, detailing meetings, and trading comments with editors, and thousands of other documents bearing notes from interviews, drafts of articles, and other sensitive information. Because of Chen’s beat, some of these documents probably reveal secrets of great economic and business value in the Silicon Valley. Under traditional, outmoded Fourth Amendment rules, the police can read every single document they possess, so long as they intend only to look for evidence of the crime, and under the “plain view rule,” they can use any evidence they find of other, unrelated crimes in court against Chen or anyone else.

If the California state courts share my concerns about overbreadth, they should consider embracing the very sensible rules for search warrants for computer hard drives (in any case, not just those involving journalists) adopted last year by the Ninth Circuit in United States v. Comprehensive Drug Testing. To paraphrase, in cases involving the search and seizure of computers, the Ninth Circuit requires five things: (1) the government must waive the plain view rule, meaning they must agree not to use evidence of crimes other than the one under investigation that led to the warrant; (2) the government must wall off the forensic experts who search the hard drive from the investigating the case; (3) the government must explain the “actual risks of destruction of information” they would face if they weren’t allowed to seize entire computers; (4) the government must use a search protocol to designate what information they can give to the investigating agents; and (5) the government must destroy or return non-responsive data.

These rules are especially needed when the target of a police search is a journalist (in fact, they may not go far enough). And these rules may be required under Zurcher. In justifying the search of the newspaper’s offices in Zurcher, the Supreme Court agreed that when the Fourth Amendment’s search and seizure rules collide with First Amendment values, like freedom of the press, the “Fourth Amendment must be applied with ‘scrupulous exactitude.'” The court went on to explain why ordinary search warrants for news offices (remember, back in the age of paper files) meet this heightened standard:

There is no reason to believe, for example, that magistrates cannot guard against searches of the type, scope, and intrusiveness that would actually interfere with the timely publication of a newspaper. Nor, if the requirements of specificity and reasonableness are properly applied, policed, and observed, will there be any occasion or opportunity for officers to rummage at large in newspaper files or to intrude into or to deter normal editorial and publication decisions.

When the California state courts combine this thirty-year-old statement of the law with the modern realities of terabyte storage devices, they should hold that the Fourth Amendment requires magistrate judges to play an integral and active role in the administration of the search of Jason Chen’s computers and other storage devices. At the very least, the courts should forbid the police from looking at any file timestamped before March 18, 2010, and in addition, they should force the police to comply with the Comprehensive Drug Testing rules. In the terabyte age, these rules are necessary at a minimum to prevent the police from interfering with a free press.

Netflix Cancels the Netflix Prize 2

Today, Netflix announced it is canceling its plans for a second Netflix Prize contest, one that reportedly would have involved the release of more information than the first. As I argued earlier, I feared that the new contest would have put the supposedly private movie viewing and rating habits of Netflix customers at great risk, and I applaud Netflix for making a very responsible decision. No doubt, pressure from the private lawsuit and FTC investigation helped Netflix make up its mind, and both are reportedly going away as a result of today’s action.

Netflix's Impending (But Still Avoidable) Multi-Million Dollar Privacy Blunder

In my last post, I had promised to say more about my article on the limits of anonymization and the power of reidentification. Although I haven’t said anything for a few weeks, others have, and I especially appreciate posts by Susannah Fox, Seth Schoen, and Nate Anderson. Not only have these people summarized my article well, they have also added a lot of insightful commentary, and I commend these three posts to you.

Today brings news relating to one of the central examples in my paper: Netflix has announced plans to commit a privacy blunder that could cost it millions of dollars in fines and civil damages.

In my article, I focus on Netflix’s 2006 decision to release millions of records containing the movie rating preferences of “anonymized” users to the public, in order to fuel a crowd-sourcing competition called the Netflix Prize. The Netflix Prize has been a huge win for Netflix’s public relations, but it has also been a win for academics, who have used the data to improve the science of guessing human behavior from past preferences.

The Netflix Prize was also a watershed event for reidentification research because Arvind Narayanan and Vitaly Shmatikov of U. Texas revealed that they could reidentify some of the “anonymized” users with ease, proving that we are more uniquely tied to our movie rating preferences than intuition would suggest. In my paper, I argue that we should worry about this privacy breach even if we don’t think movie ratings are terribly sensitive, because it can be used to enable other, more terrifying privacy breaches.

I never argue, however, that Netflix deserves punishment or sanction for having released this data. In my opinion, Netflix acted pretty responsibly. It consulted with computer scientists in a (failed) attempt to anonymize successfully. It tried perturbing the data in order to make reidentification harder. And other experts seem to have been surprised by how easy it was for Narayanan and Shmatikov to reidentify. Even with the benefit of hindsight, I find nothing to blame in how Netflix handled the privacy implications of what it did.

Although I give Netflix a pass for its past privacy breach, I am astonished to learn from the New York Times that the company plans a second act:

The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals’ “taste profiles,” the company said. The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months.

Netflix should cancel this new, irresponsible contest, which it has dubbed Netflix Prize 2. Researchers have known for more than a decade that gender plus ZIP code plus birthdate uniquely identifies a significant percentage of Americans (87% according to Latanya Sweeney’s famous study.) True, Netflix plans to release age not birthdate, but simple arithmetic shows that for many people in the country, gender plus ZIP code plus age will narrow their private movie preferences down to at most a few hundred people. Netflix needs to understand the concept of “information entropy”: even if it is not revealing information tied to a single person, it is revealing information tied to so few that we should consider this a privacy breach.

I have no doubt that researchers will be able to use the techniques of Narayanan and Shmatikov, together with databases revealing sex, zip code, and age, to tie many people directly to these supposedly anonymized new records.

Because of this, if it releases the data, Netflix might be breaking the law. The Video Privacy Protection Act (VPPA), 18 USC 2710 prohibits a “video tape service provider” (a broadly defined term) from revealing “personally identifiable information” about its customers. Aggrieved customers can sue providers under the VPPA and courts can order “not less than $2500” in damages for each violation. If somebody brings a class action lawsuit under this statute, Netflix might face millions of dollars in damages.

Additionally, the FTC might also decide to fine Netflix for violating its privacy policy as an unfair business practice.

Either a lawsuit under the VPPA or an FTC investigation would turn, in large part, on one sentence in Netflix’s privacy policy: “We may also disclose and otherwise use, on an anonymous basis, movie ratings, consumption habits, commentary, reviews and other non-personal information about customers.” If sued or investigated, Netflix will surely argue that its acts are immunized by the policy, because the data is disclosed “on an anonymous basis.” While this argument might have carried the day in 2006, before Narayanan and Shmatikov conducted their study, the argument is much weaker in 2009, now that Netflix has many reasons to know better, including in part, my paper and the publicity surrounding it. A weak argument is made even weaker if Netflix includes the kind of data–ZIP code, age, and gender–that we have known for over a decade fails to anonymize.

The good news is Netflix has time to avoid this multi-million dollar privacy blunder. As far as I can tell, the Netflix Prize 2 has not yet been launched.

Dear Netflix executives: Don’t do this to your customers, and don’t do this to your shareholders. Cancel the Netflix Prize 2, while you still have the chance.