An open letter to Netflix from the authors of the de-anonymization paper

randomwalker · on March 15, 2010

I'm a little taken aback by the tone of some of the comments here, so I thought I'd offer a few points of clarification.

* For the longest time we mostly stuck to doing the math. We certainly didn't call for the contest to be cancelled, and we had nothing to do with the lawsuit. But when some people implied that we were responsible for the mess that ensued, we were kind of pulled into it. We posted this as a way of explaining our point of view and reaching out to see if there's a possibility of collaboration.

* The sadness we expressed is genuine. The reason we brought up Netflix's response to our paper wasn't "snark" or "gloating." Rather, we were pointing out that the cancellation of this contest was rather needless, because if they had acknowledged the privacy risks back when we published the paper, they would have had more than enough time to deploy an opt-in system for this contest. I think it is really unfortunate that that didn't happen.

* Someone wanted to know exactly what I thought of the "greater good" argument. Well, I'll tell ya. I'm vehemently opposed to it and I think it's a dangerous slippery slope. I think this point of view is enshrined in the ethos of this country -- "better let ten guilty men walk free than to convict one innocent man," etc. I don't think anyone has the moral authority to decide that the privacy concerns of a few can be sacrificed.

* There is a specific reason we chose the open letter format rather than communicating with Netflix directly. Acutally, two reasons. First, there are many data privacy researchers who are at least as qualified as we are for this role. We wanted to make sure the community had the opportunity to participate in whatever ensures, rather than just us.

Second, I'm sure there are many companies other than Netflix who have a similar need for privacy preserving data mining. If Netflix doesn't take us up, perhaps one of the others will. Bottom line, since there are multiple parties on both sides, and we don't really know who they are, we felt it is better to have this dialog in public.

* Finally, it is understandable when something like this happens to want to find someone to blame. But think twice before shooting the messenger.

houseabsolute · on March 16, 2010

> Rather, we were pointing out that the cancellation of this contest was rather needless, because if they had acknowledged the privacy risks back when we published the paper, they would have had more than enough time to deploy an opt-in system for this contest.

And get 0.1% participation.

greendestiny · on March 16, 2010

I'm not actually sure it was ethical doing this research on a set of users who had no desire to be de-anonymized. While the data was out there, and anything you came up with on other data could certainly have been used to de-anonymize the netflix data, specifically connecting the two seems like a mistake.

In fact in using this dataset you've kind of done exactly what you set out to stop - used peoples data in ways they haven't consented to.

Aron · on March 15, 2010

I am gonna make a note on only a small part of this, which is that 10 guilty men is different than 'an uncountable number of guilty men'. In other words, all principals have soft margins in practice.

earl · on March 15, 2010

Please. You directly enabled the lawsuit. At least have the integrity to acknowledge your part -- your pretense "Oh wow, a lawsuit just happened! But I had nothing to do with it!" -- is pretty stupid.

I'll bet money the most likely result of this lawsuits -- and your actions are a big piece of it -- is Netflix, et al, will release no more datasets to the public. Instead, only researchers under NDA will be allowed to work with the data, as was basically the tradition before this.

Good job.

jvdh · on March 16, 2010

Please. You're acting like Netflix did nothing wrong in releasing the dataset when they knew that they could be pretty easily de-anonimized. Thus providing a privacy risk.

nkurz · on March 15, 2010

I'm saddened to see someone gloating at having helped to prevent the release of a dataset that I see as beneficial. Netflix offered an unprecedented corpus for research, and now someone is proud about helping the lawyers to lock it up. I think I just have a fundamentally different sense of privacy than the author. I think this comes out most clearly in their FAQ:

  Furthermore, even if the algorithm finds the "wrong" 
  record, with high probability this record is very similar 
  to the right record, so it still tells us a lot about the 
  person we are looking for.

So the "violation of privacy" occurs even we don't actually reveal information about the individual, even if we only provide a framework for making predictions? So if I publish a study (with backing data) that says that 38 year old males are likely to commit adultery, I've "violated the privacy" of all 38 year old males?

Could someone who shares the author's worldview try to explain it? I've tried, but I just don't see it.

raganwald · on March 15, 2010

> Could someone who shares the author's worldview try to explain it?

Well, I can give you my worldview. Did you see the article "70% of HR managers turn down job candidate based on online reputation?" Here's the HN discussion: http://news.ycombinator.com/item?id=1192996

So. I rent a bunch of movies with strongly opinionated themes. I dunno, maybe every Michael Moore movie plus travel documentaries about Cuba. I also rent the entire Star Wars sextilogy and The Godfather Trilogy, which I review on IMDB.com. Some enterprising hacker ties my reviews on IMDB to my anonymized rental record. How many 47 year-old males are there in my neighborhood who watched those exact movies?

The hacker publishes his "findings," of course. Now I go looking for a job and someone in HR decides that my political views are too risky, so I don't even get an interview.

The author's views are that Netflix can provide the benefits you desire without compromising the privacy I desire. I think the debate should be around whether the authors are correct.

If the authors are correct, then the problem isn't the authors preventing the release, the problem is Netflix failing to learn from their prior mistake.

If the authors are incorrect, we should simply point that out.

nkurz · on March 15, 2010

  Now I go looking for a job and someone in HR decides that 
  my political views are too risky, so I don't even get an 
  interview.

It's not ideal, but basing hiring decisions on real information (even if 'private') strikes me as better than using uninformed prejudice. In the example you give, is it better or worse to have the HR person say "Homoiconic? Sounds gay to me!"? I don't see much increased risk in offering a deeper pool of inherently unverifiable information, even if this information might be used badly.

  The author's views are that Netflix can provide the 
  benefits you desire without compromising the privacy I 
  desire. I think the debate should be around whether the 
  authors are correct.

This is a good debate, but for me it's a secondary one. Even if a privacy-safe solution is theoretically possible, I fear the only likely outcome will be that Netflix buries a 'release authorization' somewhere deep within their terms of service and is dissuaded from offering similar data sets in the future.

I also think that for my needs 'anonymize and release' is the only feasible solution. While online systems might be better than nothing, the approaches I'm interested in (kNN on GPU's, massively parallel cross-validations) really require local data and full control over data layout. But perhaps there is indeed a solution that works for everyone.

madair · on March 15, 2010

I'm going to try to paraphrase your concerns that the privacy fears are overblown and we're just losing something valuable, I hope you don't mind, please tell me if I got it wrong:

>> It's not ideal, but basing hiring decisions on real information (even if 'private') strikes me as better than using uninformed prejudice. In the example you give, is it better or worse to have the HR person say "Homoiconic? Sounds gay to me!"? I don't see much increased risk in offering a deeper pool of inherently unverifiable information, even if this information might be used badly.

Paraphrased: "They have poor information sources for their prejudice now, I don't see what harm better information sources driving their prejudice could do."

>> This is a good debate, but for me it's a secondary one. Even if a privacy-safe solution is theoretically possible, I fear the only likely outcome will be that Netflix buries a 'release authorization' somewhere deep within their terms of service and is dissuaded from offering similar data sets in the future.

Paraphrased: "The debate on privacy dangers is secondary, if Netflix refuses to release useful data in the future then that's the primary danger"

>> I also think that for my needs 'anonymize and release' is the only feasible solution. While online systems might be better than nothing, the approaches I'm interested in (kNN on GPU's, massively parallel cross-validations) really require local data and full control over data layout. But perhaps there is indeed a solution that works for everyone.

Paraphrased: "I was working on a project with the data, and the FTC ruling appears to have put a stop to that."

nkurz · on March 15, 2010

  They have poor information sources for their prejudice
  now

Yes to that part. We should base our choices on our current situation, not an idealized view.

  I don't see what harm better information sources 
  driving their prejudice could do.

That loses too much nuance. I can see that it could do harm, but think that increased information could also reduce prejudice. In the particular case of movie reviews, I see the added risk as low, and the potential benefit as small but significant.

  The debate on privacy dangers is secondary, if Netflix
  refuses to release useful data in the future then that's 
  the primary danger

No, I mean secondary in the sense that we need to sort out the what the actual privacy dangers are before we try to come up with workarounds. If the outcome is that Netflix embeds some small print requiring all users to allow the release of 'suitably anonymized' rating data, has this changed anything? We still need to determine the actual risks. Perhaps the privacy advocates so right about the dangers that even an opt-in system should be prevented.

  I was working on a project with the data, and the FTC
  ruling appears to have put a stop to that.

People already using the data set internally will likely be unimpacted. The Netflix terms of release presumably still apply, and prevent most commercial usages. But research papers that need to cite a publicly available data set will be adversely impacted, and the legal 'cloud of fear' will likely prevent the release of future data sets.

madair · on March 15, 2010

Thanks for responding. It helps, I was so tempted to argue with your points as I had misread them.

I feel strongly about the primacy of privacy because the problems of prejudice are so severe. But the transparent world approach has a lot of depth to it, and I'm nervous and excited to see where it leads us, since like it or not it the technical and social changes we're seeing are creating so much data.

Here are the issues I think we have with this particular data release. I understand they may be resolvable though, and while I argue against accepting privacy problems like this right now, I would definitely love to find out that we can shift to a more transparency society without reservation and without the level of abuse that I fear:

-- Information asymmetry, since most people don't typically keep the type of statistical skills, computing power, and even secondary data sets for cross joins and comparisons around. Information asymmetry is a widely recognized concern in social behavior and commercial law (i.e. insider trading & real estate contracts)

-- Ethical concerns seem to constrain us to taking people's own privacy expectations into account. Originally privacy loss didn't seem to be a problem, but the situation changed when evidence emerged to the contrary

-- History shows us that governments frequently overstep their bounds. Rather than argue recent events which are more clouded by emotion and the narrative of the day, I think most people would agree that the FBI in the 60s was quite abusive in their gathering and use of data

That said, transparent world all the way. How do we get there?

jerf · on March 15, 2010

"I'm going to try to paraphrase your concerns that the privacy fears are overblown and we're just losing something valuable, I hope you don't mind, please tell me if I got it wrong:"

"I'm going to just rewrite your argument" is not a valid form of debate. Please don't do that, ever.

madair · on March 15, 2010

I'm not so sure if you're correct. I was polite, and I wanted to know.

When you're on the Internet it's hard to know where people are coming from. I can see now how I misread him, I wish I didn't, but I did. How else do I find out politely what the terms of a debate are in a public forum? Is it truly better to not attempt at true comprehension of a person's words? Is it all just throw-away, a comment on an aggregation site?

And this was a tough thing to ask politely, simply because I had a presumption which was wrong. But I tried to do it right. I think the response I got back was pretty clear, the OP knows the fundamentals of the privacy debate, he/she has an opinion on the privacy issue itself, and the OP is not too offended by my best take on a gentle nudge at the bias question, a pretty safe guess from that is that he understands the questions of bias as a safety net for intellectual thought rather than as an offensive gesture.

I didn't know all of that, and so I asked. That's all it takes.

It's hard to communicate well online. It's this new thing for the human mind, without a 100,000 years of conditioning. Every day I log on to Hacker News for two things: It's shockingly educational in many disciplines; and I want to learn to communicate so I can better participate in society in a positive way. What a challenging place to do it. In the process I've found out that my communication needs a lot of work. I've been lurking since '88 and lurking didn't teach me communicating, what gives ;-).

But I'm pretty sure I took a reasonable tack here. Listening comprehension, interest in knowing what people really mean in short form communication, and trying to find out why people say things that we disagree with seem to be critical needs in this complex medium.

andrewescott · on March 16, 2010

Was this meant to be a joke?

When you rewrote the argument as "I'm going to just rewrite your argument", you ignored the same principle that you were proposing.

Apologies, if indeed it was meant to be a joke.

nkurz · on March 15, 2010

I voted him up because I'm taking him at his word that his goal is to paraphrase, and that I wasn't very clear in my initial writing. If both speakers have equal opportunity to respond, saying back in your own words what you think you are hearing is a great way to reach understanding. If nothing else, it made me reread what I wrote to see if I could state things more clearly.

Jun8 · on March 15, 2010

The important point is that just renting out the movies isn't enough to reveal your identity, you also have to review them (anonymously) on IMDB or some such site. So the solution is simple: If you rent a movie that you would object to being listed in, e.g. your public Facebook profile just don't review it. Is this so hard?

lt · on March 15, 2010

That's incorrect. His public Star Wars reviews linked him to the Michael Moore's rentals he wished to remain private.

Jun8 · on March 15, 2010

You're right. How about no public reviews on IMDB then? Or create another different users? I rent many movies from Netflix that people would find "objectionable", this is one reason I don't create any public reviews.

raganwald · on March 15, 2010

This would require me to figure out in advance what public data links me to what private-but-anonymized data. We know about the IMDB-Netflix connection because it was pointed out after the fact. What other links are there that I can't predict? What if I can be identified by which nights I stay home to watch a movie? Generalizing, the problem with anonymized data sets is that it is very, VERY hard to say with confidence that the data cannot be connected to a client.

That kind of data should not be released without consent.

tedunangst · on March 15, 2010

Except the Netflix data didn't contain your age or your sex or your neighborhood.

waterlesscloud · on March 15, 2010

The proposed second contest would have contained some demographic info. Hence the concern.

philwelch · on March 15, 2010

I imagine you can predict age, sex, and race from the selection of movies.

randomwalker · on March 15, 2010

'Gloating' is so diametrically opposed to the view expressed in the article that I have no idea how to respond.

As for that part of the FAQ, it is intended as an explanation of some of the theorems proved in the paper and is a response to some of the theoretical objections we face from the data privacy community. It is not an issue that arises in practice.

dasil003 · on March 15, 2010

This is the part that did you in:

Instead, you brushed off our claims, calling them “absolutely without merit,” among other things.

Now even if you are all emotionless, 100% objective researchers with no interest other than the greater good, this very sentence will make it impossible for humans to read your letter without implying a certain level of snark.

If you want to project pure motives for the letter then it would have been best to leave their reaction to your original research out of it, or, probably more appropriately, don't publish an open letter at all—contact them directly.

madair · on March 15, 2010

Ah yes, Appeal to Motive, that'll do it, now I'm not going to accept anything they said because of my interpretation of their motives.

http://en.wikipedia.org/wiki/Appeal_to_motive

foldr · on March 15, 2010

>now I'm not going to accept anything they said

I don't think anyone is refusing to accept anything he says. They're just disapproving of his motives, which is fine.

dasil003 · on March 15, 2010

Open letters are about politics, not logic. What an individual critical thinker such as yourself believes has no bearing.

madair · on March 15, 2010

Open letters are about communicating a message, they are also by definition political, and it's really helpful to our social systems when they are based on logic.

Analysis of motives is of course valuable. But it's not an argument against the matter at hand. It may not have been clear that I was being facetious.

pbh · on March 15, 2010

I think the parent is referring to an element of we-told-you-so in this letter. I think around the "instead, you brushed off our claims" part of the letter. It sounds a bit confrontational, though perhaps it's too late to rewrite.

grandalf · on March 15, 2010

It sure makes the writers sound like asses.

nkurz · on March 15, 2010

Ironically (and I think this is a legitimate use of the term) I hadn't realized when writing my comment that 'randomwalker' and 'Arvind Narayanan' were one and the same. But I apologize for my misinterpretation. I've read it again, though, and still have to fight the same reaction. Perhaps I'm just prejudiced against the 'open letter' format, as I often see it used in that way.

Still, if you can overlook my misreading, I would love your response. I understand that the example from the FAQ is not central to your argument, but I do think it's important in understanding your worldview. If I provide a tool that can be used, erroneously or not, to create or enable prejudice, at what point have I crossed the line into 'violation of privacy'?

randomwalker · on March 15, 2010

I responded to most of the points in a new top-level comment. As for whether there is a violation of privacy even if the conclusions are erroneous, no, I don't think that it is. Some of my colleagues argue that it is, but I don't agree with that point of view. In other words, in order to convince you that there is a privacy problem I have to convince you of the math in the paper. Merely the fact that we came up with an algorithm that may or may not be correct is not sufficient.

That FAQ question is a big red herring. Some of the objections to our paper during peer-review were along the lines of "if Netflix had esentially duplicated every record in the database, how could you be sure you found the right record? What does right record even mean?" No really, it was that silly. So it was meant to be a way to justify the fact that de-anonymization can happen even if you didn't find "the right record."

whyenot · on March 15, 2010

Instead, you brushed off our claims, calling them “absolutely without merit,” among other things. It has taken negative publicity and an FTC investigation to stop things from getting worse. Some may make the argument that even if the privacy of some of your customers is violated, the benefit to mankind outweighs it, but the “greater good” argument is a very dangerous one. And so here we are.

In other words, "we showed them!" -- seem an awful lot like gloating to me.

It also doesn't appear that you ever seriously considered the "greater good" argument, beyond asserting that it's "dangerous." This leaves me with the suspicion that you chose to take the easy way out instead of weighting the actual costs and benefits of what you were doing.

orangecat · on March 15, 2010

This leaves me with the suspicion that you chose to take the easy way out instead of weighting the actual costs and benefits of what you were doing.

You're shooting the messenger here. If a dataset can be de-anonymized, it's better to know that and be able to make informed decisions.

earl · on March 15, 2010

Well, I can say this: congrats on getting one of the best industrial data sets locked up. I hope you're proud of the work you did.

As for the possibility of netflix running a contest like this in an online fashion, well, maybe, but the benefits of having access to the data are enormous, plus you've now moved to a model where only the privileged few are allowed access via NDA, or Netflix has to provide computing resources to all researchers, etc. I don't see it happening.

pyre · on March 15, 2010

If Netflix had attached credit card info and social security numbers to the info would you be singing the same tune? You're basically saying that you don't like the outcome due to your perceived utility of the data. Thanks I don't see you talking about:

  - Do you view this as a breach of privacy?
  - What do you consider private?
  - Do you view this as a breach of privacy, but just
    don't care?
  - Do you feel that the utility of the data out-weighs
    the privacy concerns?
  - What about the people that view this as an invasion
    of privacy and have their Netflix user data in that
    set? Should they be thrown under the bus in the pursuit
    of progress because *you* feel that the data has more
    utility than the privacy concerns do?

I see a lot of people arguing that this is 'stifling innovation,' but innovation is not an end unto itself. Banning using human test subjects against their will in the pursuit of scientific knowledge 'stifles innovation' too, but I think you would be hard-pressed to find many people to see that as a bad thing. "Stifling innovation" in the pursuit of privacy concerns should be a noble cause. It benefits the public. This is hardly the argument against intellectual property rights and I really find it annoying that people seem to be lumping it into the same ballpark with these boilerplate "stifling innovation" comments.

Jun8 · on March 15, 2010

I see it as a breach of privacy people might have prevented had they known about the dangers of their reviews getting linked to their accounts. Many companies have their large credit card databases stolen or hacked into through sheer incompetence. Netflix is not in the same boat with these.

inerte · on March 15, 2010

So... having private info stolen == bad company, releasing private info == good company.

earl · on March 15, 2010

And if pigs were ducks would they quack? In other words, your question is moronic since netflix didn't attach cc or ssn info.

Also, you're a retard for comparing movie predictions, and the possibility of matching a person to their movie viewing history, as even remotely comparable to human test subjects. Please.

As for invasion of privacy, I'm unsure -- I'm not sure of the probability of matching, the quantity of information necessary to get a good, for various values of good, match, etc. What is clear is the authors had a major hand in stifling a nontrivial nonacademic dataset and damaging the community around it. They further have aided the lawyers suing netflix, and have helped poison the well for any company in the future that decides they might want to do something like this. So I say congratulations! For the author to pretend this didn't happen as a result of his actions is disingenuous.

As for your questions, well, they're just stupid. We live in a world where the fbi/police get access to your PHYSICAL LOCATION 24x7 without a warrant just by asking, where your emails and telephone calls are scanned by the nsa with plans to open this data set to the police at large, where google/yahoo/et al see turning your emails and access patterns over to the police as a revenue opportunity, etc. If you care about privacy, this is such small potatoes as to be a waste of time. BTW, anyone can still spend roughly $100 to access your phone call history. G has, in subpoenable form, your entire search history -- and don't think that clearing cookies prevents stapling that history together.

pyre · on March 16, 2010

> We live in a world where the fbi/police get access to your PHYSICAL LOCATION 24x7 without a warrant just by asking, where your emails and telephone calls are scanned by the nsa with plans to open this data set to the police at large, where google/yahoo/et al see turning your emails and access patterns over to the police as a revenue opportunity, etc.

So you're saying that since government agencies have access to a lot of my private information, I shouldn't care about any of my private information remaining private? Sounds like you're creating a false dichotomy. You're presenting things as if you can only care about all of your private data or none of your private data; since the government has access to large portions of your private data and you don't have much (or any) control over that, you should therefore care about none of your private data. Isn't it possible for me to care about all of my private data, but to choose the battles that I fight?

earl · on March 16, 2010

And your hill to die on is that someone might guess a movie someone in your household rented. Okay then.

pyre · on March 16, 2010

Death of a thousand cuts

gills · on March 15, 2010

Or, are they opening the door for you to profit from selling access to a utility map reduce cluster focused on their data set?

gwern · on March 15, 2010

> So if I publish a study (with backing data) that says that 38 year old males are likely to commit adultery, I've "violated the privacy" of all 38 year old males?

Yes. The difference between 'John Elks of 7 Arborview is a rapist' and 'there is a 0.5 correlation between living in Pleasantville and being a rapist' or '38 year old males commit 10% more rapes than average' etc. is solely one of degree and not kind.

Suppose I have a set of datapoints like that. And let's say each datapoint applies to only half the population. How many datapoints before I have broken your privacy and linked you to the furry porn you like to rent? Well, I'm guessing you're an American male. The US population is ~300 million, and roughly half of that is male, so 150 million. The first datapoint pins you down to within 75 million. The second, down to 33 million. The third, down to 16 million, the fourth, 8 million, the fifth 4 million, the sixth 2 million, the seventh 1 million, the eighth 500,000 (starting to feel nervous yet?), the ninth 250,000, the tenth 125,000, the eleventh 75,000, the twelfth 30k, etc. until the 25th or 26th specifies just 1 - you.

Now, tell me: Where in this slippery slope did it suddenly flip from not being a privacy violation of some degree, to being a privacy violation?

Was it at the 5th bit of information? Are you damaged at the 12th bit of information? Or did it take until the 24th or 25th bit of information before it magically flips from being good science to bad privacy violation?

Is it fine just so long as it might also be your neighbor down the street, even though most people would shun you based on far less than a 50-50 chance of things like being a child rapist? (An employer on the bench might regard a 10% chance of you being objectionable as being too much; that only requires, what, 18 bits of information?)

Predictions embody a great deal of information. That's how Bayesian statistics and statistics in general work, after all.

foldr · on March 16, 2010

The difference between poking someone and shooting them is also "one of degree rather than kind", but we choose to make a categorical distinction between the two cases nonetheless.

tedunangst · on March 15, 2010

I'm very curious about what bits of information you think exist that so precisely bisect the population.

conover · on March 15, 2010

The Electronic Frontier Foundation does it to browsers pretty easily.

http://panopticlick.eff.org/

And according to another article by them, all you need is zip code, gender and birthday to identify someone with a high degree of certainty.

http://www.eff.org/deeplinks/2009/09/what-information-person...

tel · on March 16, 2010

Of course, zip code and birthday are a pretty huge amount of information. With the simplifying assumptions of roughly uniform distributions, knowing a birthday confers ~8.5 bits and knowing your zip code is worth ~15 bits. That's 23.8 in total.

It's not surprising that those two pieces of information can pretty easily narrow down an identity.

  log2(3*10^8) = 28.2

tedunangst · on March 15, 2010

Birthday encodes a lot more than one bit of information. The above example was a series of factors that reduced the identified population in half.

gwern · on March 15, 2010

Alright; would you prefer me to redo it with pieces of information each worth 7 bits?

'The first piece cuts it down to 2.3 million people...'

imurray · on March 15, 2010

I'm saddened to see someone gloating at having helped to prevent the release of a dataset that I see as beneficial.

I'd encourage people that usually skim just the comments to read the post, which was not gloating at locking up the dataset. The thrust of the post is about what a shame it is that the contest was canceled and how to make sure future contests can work.

[Note one: I have nothing to do with either side. Note two: I guess there is a gloating interpretation, with the paraphrase "you ignored us, but the FTC said we were right, nah nah nah" — but this isn't a useful or constructive way to continue the conversation.]

lvvlvv · on March 16, 2010

"you ignored us, but the FTC said we were right, nah nah nah"

That how I read their open letter. And my interpretation of "you should have worked with us" is as "you should have hired us as consultants"

madair · on March 15, 2010

Netflix has a lot of subscribers, but it doesn't have that many, and in particular, in your neighborhood, does it have that many who are 38 year old males who like action movies and Michael Moore? It's a question partly of scale. It's big, but not that big that we can't get dangerous results with demographics & statistical analysis.

It comes back to the contemporary problem of statistics: We, programmers included as evidenced by many discussions of the sort, have a hard time getting it. We are only at the cusp of understanding what statistics can do with large data sets.

Now, we can either work with researchers and organizations to deal with the very real concerns, or we can simply refuse to believe that giraffes don't exist because we never saw an animal with such a long neck before. It's the basic problem of progress.

hooande · on March 16, 2010

I looked at the methods used in the paper and it's clear that my definition of "privacy" varies greatly from the author's definition.

Essentially they are saying that if you know what rating someone gave 8 movies and the date that they gave those ratings, you can find a sample of their rating list (or something very similar to it) with 99% accuracy. So freaking what?

"Evidence" that flimsy wouldn't stand up in a local bar argument, much less a court of law. The records are still completely and totally anonymous. No names, no addresses, no way to identify anyone...nothing but a strong statistical correlation to a set of ratings in a database.

It sounds like they have a problem with the power of predictive modeling and not with the handling of anonymous data. Esstentially what they are saying is "if we know a little about you, we find can out things that we didn't know with a very high degree of accuracy, but no certainty.". Duh. That's what the whole netflix prize was about...using known data to make strong predictions about unknowns.

They had some interesting methods (especially in their similarity calculations) but this has nothing to do with privacy.

pbh · on March 15, 2010

It seems to me that any data at all will necessarily reduce the entropy of the probability distribution of members' preferences, likes, dislikes, habits, and so on (i.e., their privacy). The authors seem to brush off the "greater good" argument here, but I don't understand how any large scale data release can happen without at least some reference to such an argument given that context. Given that, the authors seem to be making a fairly strong claim here: that no large scale "anonymised" data release should ever happen. Is that helping anyone in the context of movie viewing? And is it hurting anyone other than academic researchers, given that companies share more sensitive data anyway?

raganwald · on March 15, 2010

The authors mentioned two alternatives to the current form of large-scale release. First, opt-in. Second, contestants submit programs that run on the anonymized data but the contestants do not have access to the data itself. Could either of these approaches contribute to the greater good without compromising privacy?

pbh · on March 15, 2010

I do not have any data regarding opt-in, but my impression was that as a rule, no one opts-in, and no one opts-out of basically anything (excepting the notorious cases, e.g., Real Player). If it worked, and somehow gave an at least somewhat unbiased sampling of the data, opt-in would obviously be best.

I am completely unsatisfied with the submit-and-run model. Feature engineering seems to rely on knowing your data really intimately, and that does not seem possible in a submit-and-run model.

smokey_the_bear · on March 15, 2010

What happened with Real Player?

pbh · on March 15, 2010

The Real Player installer, over the course of maybe five to ten years, was so pushy about installing extra, unwanted software and sending private data that it garnered a reputation that caused people to be really careful when installing it, if they installed it at all. It is not really a perfect example for people actually opting-out, because I think one of the many criticisms was that it was often either not possible or extremely difficult to determine how to opt-out of its features (sending titles of files being played, annoying message center and ad popups, packaged additional software). That said, check out the Wikipedia page for further details.

stevoski · on March 15, 2010

IIRC, in some countries census data is sometimes released with small, intentional errors to prevent the ability to locate specific individuals. Make a 36 year old sometimes a 37 year old or a 35 year old. Make a 180 cm person sometimes 182 cm or 178 cm. Small enough errors not to make the aggregate data invalid, but enough to make it hard to identify individuals from the data.

Perhaps this is a partial solution for the Netflix dataset.

nkurz · on March 15, 2010

This is the approach that Netflix took with the initial data. The paper referred to shows that this is insufficient, and does little to ease privacy concerns. The general problem is that if you 'fuzz' up the data enough to make identification impossible, it's no longer useful as a dataset.

wooster · on March 16, 2010

The Census obfuscations have apparently screwed up a variety of research findings:

http://freakonomics.blogs.nytimes.com/2010/02/02/can-you-tru...

prakash · on March 15, 2010

fyi: randomwalker is Arvind Narayanan.

bmickler · on March 15, 2010

P.S. - BTW, when do you expect to allow your linux-using, paying customers to watch your instant streaming movies online?!

(off topic rant, I know. goodbye karma!)