In my last post, I had promised to say more about my article on the limits of anonymization and the power of reidentification. Although I haven’t said anything for a few weeks, others have, and I especially appreciate posts by Susannah Fox, Seth Schoen, and Nate Anderson. Not only have these people summarized my article well, they have also added a lot of insightful commentary, and I commend these three posts to you.
Today brings news relating to one of the central examples in my paper: Netflix has announced plans to commit a privacy blunder that could cost it millions of dollars in fines and civil damages.
In my article, I focus on Netflix’s 2006 decision to release millions of records containing the movie rating preferences of “anonymized” users to the public, in order to fuel a crowd-sourcing competition called the Netflix Prize. The Netflix Prize has been a huge win for Netflix’s public relations, but it has also been a win for academics, who have used the data to improve the science of guessing human behavior from past preferences.
The Netflix Prize was also a watershed event for reidentification research because Arvind Narayanan and Vitaly Shmatikov of U. Texas revealed that they could reidentify some of the “anonymized” users with ease, proving that we are more uniquely tied to our movie rating preferences than intuition would suggest. In my paper, I argue that we should worry about this privacy breach even if we don’t think movie ratings are terribly sensitive, because it can be used to enable other, more terrifying privacy breaches.
I never argue, however, that Netflix deserves punishment or sanction for having released this data. In my opinion, Netflix acted pretty responsibly. It consulted with computer scientists in a (failed) attempt to anonymize successfully. It tried perturbing the data in order to make reidentification harder. And other experts seem to have been surprised by how easy it was for Narayanan and Shmatikov to reidentify. Even with the benefit of hindsight, I find nothing to blame in how Netflix handled the privacy implications of what it did.
Although I give Netflix a pass for its past privacy breach, I am astonished to learn from the New York Times that the company plans a second act:
The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals’ “taste profiles,” the company said. The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months.
Netflix should cancel this new, irresponsible contest, which it has dubbed Netflix Prize 2. Researchers have known for more than a decade that gender plus ZIP code plus birthdate uniquely identifies a significant percentage of Americans (87% according to Latanya Sweeney’s famous study.) True, Netflix plans to release age not birthdate, but simple arithmetic shows that for many people in the country, gender plus ZIP code plus age will narrow their private movie preferences down to at most a few hundred people. Netflix needs to understand the concept of “information entropy”: even if it is not revealing information tied to a single person, it is revealing information tied to so few that we should consider this a privacy breach.
I have no doubt that researchers will be able to use the techniques of Narayanan and Shmatikov, together with databases revealing sex, zip code, and age, to tie many people directly to these supposedly anonymized new records.
Because of this, if it releases the data, Netflix might be breaking the law. The Video Privacy Protection Act (VPPA), 18 USC 2710 prohibits a “video tape service provider” (a broadly defined term) from revealing “personally identifiable information” about its customers. Aggrieved customers can sue providers under the VPPA and courts can order “not less than $2500” in damages for each violation. If somebody brings a class action lawsuit under this statute, Netflix might face millions of dollars in damages.
Additionally, the FTC might also decide to fine Netflix for violating its privacy policy as an unfair business practice.
Either a lawsuit under the VPPA or an FTC investigation would turn, in large part, on one sentence in Netflix’s privacy policy: “We may also disclose and otherwise use, on an anonymous basis, movie ratings, consumption habits, commentary, reviews and other non-personal information about customers.” If sued or investigated, Netflix will surely argue that its acts are immunized by the policy, because the data is disclosed “on an anonymous basis.” While this argument might have carried the day in 2006, before Narayanan and Shmatikov conducted their study, the argument is much weaker in 2009, now that Netflix has many reasons to know better, including in part, my paper and the publicity surrounding it. A weak argument is made even weaker if Netflix includes the kind of data–ZIP code, age, and gender–that we have known for over a decade fails to anonymize.
The good news is Netflix has time to avoid this multi-million dollar privacy blunder. As far as I can tell, the Netflix Prize 2 has not yet been launched.
Dear Netflix executives: Don’t do this to your customers, and don’t do this to your shareholders. Cancel the Netflix Prize 2, while you still have the chance.
and i’m not okay – as a netflix customer – having my privacy disregarded, no matter to what the degree. i wholeheartedly appreciate this article and all of the comments made here aside from the ignorant ‘get a life’ ones. if there is scientific evidence that aggregating data can more often than not identify any given person at any given time, then netflix should be held legally responsible to protect the anonymity of their customers, either by offering an opt-out or reconfiguring their data. as a employee of a nonprofit, we are required to offer opt-out options for our online activities. netflix should be too. no one is above the law or our constitution.
Netflix, you don’t need to know when and where I was born, the color of my socks or what brand of of aftershave I use….Just give me my movies…Please 🙂
Kind regards, Tom Fox
Here’s Netflix’s response:
Dear Mr. Douglas:
Thank you for your concern but rest assured, Netflix zealously guards your privacy. All the information we’re giving in The Netflix Prize 2 dataset is completely anonymous. It contains no personally identifiable information. It does not contain anyone’s name, address, or any means to connect a particular record with a specific Netflix member. As in Netflix Prize 1, the dataset contains some movie ratings from select anonymous members. It also includes some Queue adds and taste preferences, broad age ranges, gender and zip codes but, again, completely anonymous. But all that data is modified – our scientists call it perturbed – to make it anonymous. No one, no matter how sophisticated an engineer or analyst, will be able to link your name or any other Netflix member’s name to the data.
If you have any further questions, please don’t hesitate to contact our customer service department. Representatives can be reached at 866-716-0414 and are available 24 hours a day.
Sincerely,
Netflix
If Netflix really believed that, they would offer a secondary prize of $100K or so to anyone whose records were successfully de-anonymized.
What about Zip3 (aka instead of 12345 using 123)? IIRC, I believe it’s suitable for HIPAA requirements.
Much thanks to Paul Ohm for an insightful post on the dangers to privacy inherent in the Netflix contests. There is a dramatically different approach that Netflix can take while running a contest of this type that would greatly help to maintain privacy I believe.
Never release the data set. All participants in the contest would submit their algorithms to Netflix and Netflix would run the programs on local machines securely controlled by the company. The results generated and released to contestants would be statistical summaries that specify the number of matches and mismatches. The precise nature of the matches and mismatches would never be released externally – just the aggregate numbers.
While running, the programs would have full access to the database to calculate k-nearest neighbors, train neural nets, perform inductive inference, or execute some other statistical methods for predictive modeling. However, only a very narrow pipeline will be used to release data.
I dont think there is too much to worry about the actual identity leaking out if NETFLIX creates a fictional set of zip codes and map the actual zip codes to them. It will not reveal any true identity. It can do the same with age if they need to.
The question whether companies should use these algorithms or not is a different thing but I am sure this competition can run without compromising data by scrambling it.
No. The frequency of a single zip code in the set is probably very related to the population. Average age could help ID a zip. And that’s before you look at the rental History. Statistically high rental rates for videos featuring places or people of local interest would end the unanonymize-the-zip game.
One way to protect your privacy is to lie actively about these matters on the various forms on the web you fill out, This is something I generally do if the site insists on my disclosing something (such as a birthday) that it doesn’t need.
It’s probably even worse for netflix, because their penetration is still geographically spotty, and zip codes are not uniformly distributed. In rural areas I would expect that simple crossreferencing age and zip with broadband access would do nicely.
The easiest way to strengthen the anonymization it seems would be to cut the population into age swathes lqrger then a year.
I hate the term “identity theft” and I hate that some companies try to charge you $15 a month to watch out for it.
No one can steal my “identity” (at least not with currently known science). It is intrinsically part of me. All one can do is steal my personal information.
What does this allow them to do? Trick banks and other people into giving them money, etc. They aren’t taking my money. I’m not liable. The bank is stupid for giving someone money just because they know my social security number, etc. The bank is liable. Therefore, the bank should be paying $15 or whatnot a month to protect against this theft.
Just because someone knows my personal information doesn’t mean they can enter me into legally binding contracts or withdraw money from my bank account. Of course explaining this to the bank or other institution is a giant pain in the arse.
Identity theft does exists: Ending up in Jail
http://www.givemebackmycredit.com/blog/2009/08/identity-theft-sends-innocent.html
So if identity theft doesn’t exist, then why do you come to that conclusion by the end of your post?
“Of course explaining this to the bank or other institution is a giant pain in the arse.”
Clearly, that’s identity theft. You have to explain to others what your identity is. Why? Because it was stolen and used by somebody else. Stolen is synonymous for theft. You get the picture?
“Just because someone knows my personal information doesn’t mean they can enter me into legally binding contracts or withdraw money from my bank account.”
Clearly you haven’t kept up in the news…here’s something for starters (http://news.cnet.com/2010-1071-958328.html).
I was probably being over the top in saying there is no such thing as identity theft.
But I don’t think you fully understand me. I’m not a fan of a credit card company charging some extra fee to watch out for identity theft, because is already the credit card company’s duty to make sure some fraudster isn’t using your card (and as it’s always been, you aren’t responsible). So I guess what you are paying for is to prevent the inconvenience of having to deal with identity theft.
However, I still stand by that someone else (without my permission or that of the law) cannot enter me into legally binding contracts and someone else cannot withdraw money from my bank account.
Maybe a company thinks you entered into that contract, but if it wasn’t you that actually signed it (or pushed the button), you are not legally bound. If someone takes money from your bank account, they really aren’t. The bank is responsible (unless there was some type of negligence on your part) and not you. They didn’t really take money from your bank account. They took it from the profits of the bank. And that is why the bank/credit card company/etc should be paying to safeguard your information. Of course you should be careful and not be willy-nilly giving out your information, but that’s obvious.
A lot of people don’t realize this though, and so they get scared into buying rip-off ID theft protection services. Sometimes they might have their uses, but not always. There’s alwasy the possibility that the bank/credit card company doesn’t have money to refund you, etc, but that’s probably not much of a risk when dealing with reputable companies.
Agree with “Broader Implications”, and add this:
What you do to anonymize your data today may not matter in the long run, since–if you have been using the Web for more than a few days–there is enough info sprinkled around about you to link you to any new, ‘anonymous’ profile that you create. Some examples: IP address, cookies, browser ID string, X-Up-Calling-Line ID for mobile browsers, email address, unique usernames, unique passwords. It is difficult to use the web if you attempt to obfuscate all these things. Some of this data is not valuable enough or concentrated enough to mine *today*, but this data is *not* being thrown away (there is always somebody in the disposal chain who will set it aside for a rainy day) and will someday be mined expressly for finding those individuals (for marketing or other purposes) that have attempted to go off the identity grid.
In short, if you’re concerned about privacy today, you need to make a lifestyle choice and get seriously paranoid about it. Or, you can just start acting as if everything you do electronically is going to be public someday. (footnote: if Netflix rented real porn, there would be no Netflix Prize)
“someday be mined expressly for finding those individuals (for marketing or other purposes) that have attempted to go off the identity grid.”
I can just see it now: in the future, if you go “off the grid” you start getting ads for anonymization tools, RFID detectors and zappers, Conspiracy Magazine, and Black Helicopters Weekly. 🙂
I’ve done psych work before and this was their mistake: they should have never released the preference data as individual sets. The only way to properly release anonymized data is as collections of statistics that group multiple users together. This does make it a little harder to work with the data but depending on what the scientists and prize awarders needed, Netflix could build the data groups based on those statistics.
While grouped with other users, individual’s names could not be reversely identified. This is standard practice in psychology and sociology research, so it is a shame it was not done here. Maybe movie preferences are not considered protected information, but as the article says, what will be released next?
The contest(s) simply contradict the spirit of anonymous patronage. First this, then what?
I’ve cancelled my membership, though I’m sure I’ll be rendered.
Interestingly, in their list of reasons for cancellation, they don’t list “Disgusted with privacy policy” as one of the reasons for terminating membership.
If you match on several of the criteria but it isn’t you, you might have trouble denying that it is to interested parties (like spouses or employers). That is assuming you are even given the chance to defend yourself – you might end up judged by them and condemned behind your back.
The article also fails to mention a final, gaping problem to all of this creeping consolidation, and that is the ease with which every single online user will have their traits “identified” for the inevitable use of financially and socially unrepresentative purposes. If we think identity theft is bad now, wait until we stop using online payment systems all-together because everyone knows all our identifying traits needed to authenticate as us.
Social understanding is important, because any time you delve into algorithmic best-guess modeling as a means of averaging possibility, you’ve missed the boat entirely on the motivation behind past purchase behavior, which is infinitely more indicative of future preference. In other words, all this talk of predictive modeling is smoke and mirrors. The best way to market to you is to have you tracked via multiple sources using consolidating identifiers to build your actual profile, and re-market to you what you’ve already proven you’ll buy. Any talk of anonymous obfuscation of your data is strictly for the purpose of partial data owners not getting sued as they swap meet their lists around and figure out how to stitch customer databases together before their competition does it first. The customer is not respected nor their security to make future transactions protected.
In the current economy, forked companies all feed of their list rentals once their other IP grows stale because we haven’t yet found a universal identifier that won’t completely obliterate the checks and balances currently at work in an anonymous society.
This article has successfully convinced me to stop being a customer at any service requiring more than a credit card number and at most name/address to verify I have the right to use those funds.
Has anyone ever wondered why the default “hint” questions on most services include vital record data like “mother’s maiden name” “place of birth” “high school attended” etc?
I consider those “legitimizing” tags if your future business intent is indeed to sell your customer data to a mega-bank big brother database. These answers would conveniently be the perfect information needed to consolidate changing customer data from various sources, because they refer to early, non-changing characteristics.
This has always creeped me out, and now it just creeps me out to the point where when I sign up to a mailing list, I include as much false data as I can get away with.
I encourage others to include as much false data as well, stopping short of actually creating false aliases.
I resent the many web sites that require me to “register” in order to buy a product. Usually they send a “WELCOME” email and I reply with a savage broadside that I didn’t choose to join, I was forced to join in order to purchase. I also suggest they consider a GUEST purchase option that does not demand registration and promises to delete at least CC data once the purchase is processed.
As for non-financial sites, e.g., subscribing to the free NY Times web site, I provide totally fictitious information: wrong birthday, wrong name, male vs female, etc. I do use the same password on all such non-financial sites, a combination of four letter words, so I can re-enter any such site with just a dedicated email and not have to look up anything.
RCG
My biggest concern is one piece of information that we don’t know, and that it hasn’t seemed to occur to journalists to ask. Netflix customers, of which I am one, have the option of viewing movies and not rating them. Clearly, that reduces the benefit of any recommendations Netflix may make, but in theory, it would keep the customer’s preferences private, by the customer’s choice. Unless Netflix includes the entire viewing history in the data. The mere fact that I have rented a particular movie implies strongly that I had an interest in viewing it.
Long ago, I removed my birth year from my account profile because I didn’t feel they needed to know it. I would encourage other members to do the same.
Ohm’s paper on reidentification suggests that maintaining any amount of usefulness in the data set necessitates the compromising of the identities of the subjects in that data set, sooner or later. The key to making informed policy decisions is to decide when the utility gained from the research outweighs the risk to subjects’ identities; I would agree that in this case, the benefits do not outweigh the risks.
Doesn’t this mean that many who have casually tweeted, “Just got movie X from netflix” a few times will be identifiable in the data set? (given zip, age, etc)
Yes. About the first Netflix Prize, I say, “The next time your dinner party host asks you to list your six favorite obscure movies, unless you want everybody at the table to know every movie you have ever rated on Netflix, say nothing at all.”
And as you point out, who needs dinner parties any more? Just read Facebook or Twitter and find a fairly complete list of movies watched. Acquisti and Gross call this the “age of self-revelation.”
You guys really need to get a life. First of all, even if I did ask my dinner guests for their favorite old movies, I wouldn’t remember the answer for more than an hour. More importantly, if you think I care what else they watch – enough to remember what they said and start investigating them online – you guys don’t know the meaning of real work. I have a real job — the type that makes me want to relax during down time, not use Google.
No one who doesn’t wear a pocket protector gives a rat’s behind about this. I’m only here because I saw a link on a blog and I couldn’t believe people were actually wasting time on this.
Nuff said…I can’t get these 5 minutes back. Ugh.
Netflix doesn’t have it’s subscribers birthdate, just their birth year. I just went into my account and checked.
Netflix shouldn’t cancel their Prize 2 competition; they should just strengthen the anonymization. For instance, do they really need to release the exact ZIP code, as opposed to geographic location perturbed by, say, +/- 20 miles or so, or noisified in some other way?
That won’t cut it. According to Sweeney, 53% of people in America are uniquely identified by {city, birth date, sex} and 18% by {county, birthdate, sex}.
The idea of “strengthening the anonymization” is a losing battle, especially when you are talking about public releases. People are much more uniquely attached to their attributes than our intuitions would lead us to believe.
Obviously, there is a spectrum in how much is released, from “nothing” to “everything” (as the two extremes). In between, there are shades of grey. There’s got to be an amount that can be safely released that is more than “nothing” but that doesn’t uniquely identify people and doesn’t endanger their privacy.
For instance, most folks are willing to accept the level of anonymization performed by the Census Bureau. So how about asking Netflix to provide a comparable level of anonymization?