Calculating 316 Million Movie Correlations in 2 Minutes (Down From 2.5 Hours)

physcab · on Jan 7, 2010

Hi HN, I submitted this article because it has sat in my bookmarks for a while and I've referenced it quite a number of times over the past few months. I thought it may be helpful to some of you who work on similar problems.

I like it because when most people think of "giant datasets" they think they need some special tool to process it. Infact, there are such tools and I use them everyday (namely Hadoop), but this article is a reminder that with the proper forethought and consideration for the data structures in question, it is quite possible to wrestle a large dataset on a single machine.

Also, people always wonder how to better optimize their code. I think this is one of the few examples I've seen where the author went through a series of steps to obtain the optimization they had in mind and documented their strategy well. It serves a practical purpose too.

If you want to try your hand out at this problem you can obtain the dataset here: http://archive.ics.uci.edu/ml/datasets/Netflix+Prize

and follow the forums here: http://www.netflixprize.com/community/

blantonl · on Jan 7, 2010

I hate to ask this, but I'm going to do it anyway.

Can someone explain this to the rest of us (maybe just me) what the heck this article is about?

This article started right out of the gate assuming the reader was well informed of the context.

jey · on Jan 7, 2010

k-nearest neighbors is a simple approach for prediction in machine learning. The objective is to predict the value of a function at some point for which we don't have an observation in the training set. In the Netflix Prize, this means predicting the rating a user U would give to some unrated movie M. The kNN approach is: 1. Identify the k users most similar to U. This is called the "neighborhood". 2. Have these k neighbors vote on the rating that U should assign to M.

The premise behind the above scheme is that similar users will assign approximately the same rating to a particular movie. To actually implement the kNN scheme requires a notion of "similarity" for step 1 and "voting" for step 2. The linked article is using Pearson Correlation as the similarity function (aka the "distance metric"), and some kind of weighted average as the voting function (as mentioned in http://dmnewbie.blogspot.com/2007/09/greater-collaborative-f... )

I don't think this would work very well on the Netflix dataset because the training set is super sparse. I deliberately glossed over this above, but users in U's neighborhood who haven't rated the movie M are useless when voting on the value that U should assign to M! So you have to make a call about how you form the neighborhood: do you just find the k nearest users and only average over the <= k users who actually rated M? Or do you find the k nearest users who actually assigned a rating to M (ignoring U's neighbors who haven't rated M)? Either way, with a dataset that's as sparse as the Netflix data, you're going to have a hard time forming useful neighborhoods since either you're going to have neighborhoods where there's very little information to go off of, or the "k most similar" users are actually really not very similar to U at all, leading to inaccurate prediction.

More info: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm

Chapter two of the excellent and free book "Elements of Statistical Learning" has a better exposition of this idea. http://www-stat.stanford.edu/~tibs/ElemStatLearn/download.ht...

Xichekolas · on Jan 7, 2010

(Not really on topic, but if I could nominate this as an example of the ideal HN comment, I would. It'd be nice to have a gallery of things like this attached to the guidelines. Thanks jey!)

iamelgringo · on Jan 7, 2010

Seconded. Excellent work, Jey.

wallflower · on Jan 7, 2010

This is my explanation as an armchair observant of the science of prediction. Nice high-level overview of the Netflix Prize: the problem, the competitors, the evolving science of CF - http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html?...

CF = Collaborative Filtering (e.g. "Customers who liked X also liked Y")

KNN = k-Nearest Neighbor and a very popular CF algorithm (the winners of the Netflix Prize used KNN in part of their program)

Pearson correlation = There were 17,770 movies in the Netflix Prize dataset. Necessary input to KNN model. Which movies might be considered neighbors (e.g. similar) to each other.

tsiki · on Jan 7, 2010

I think the basic gist of the article is that originally he was comparing movie X to every other movie, one at a time, and in any single such comparison he had to fetch all the users who had rated both movies to calculate the Pearson correlation. But later he realized it's much more efficient to just cycle through the users themselves who have rated movie X and check what other movies they have rated and to calculate the Pearson correlations from that data.

unwind · on Jan 7, 2010

I can't touch jey's answer when it comes to the algorithm, but I think this link: <http://www.netflixprize.com/>; is helpful to establish the context.