Tuesday, October 23, 2007

Wisdom of crowds and evolution

Last week, I read one of the best articles I've ever seen posted on slashdot. While the subtext discusses the likelihood of religious fundamentalists to "believe in" things like wikipedia, the examples (wikipedia, prediction markets, and recommendation systems) nicely illustrate the whole "Wisdom of Crowds" concept if you're confused about it.

For me, the most interesting example he gave was regarding the recommendation system for Netflix. Netflix had been offering a prize to anyone who could improve on their existing collaborative filtering metric by 10 percent. So for the contest, Netflix released 500K users' ratings, covering 18K movies. Just the ratings of each user for a given film title, that's it. G'luck!

The first thing people want is more data. Genres, actors, directors, box-office money...all this stuff is important, right? That's definitely how I'd solve the problem on a math test - quantify all known variables and look for patterns.

Of course, this is completely wrong. WRONG WRONG WRONG. You don't need any of that crap. That data is fool's gold, man!

The solution lies in putting people in certain "neighborhoods" - a genre like "comedy" or even "slapstick comedy" is too vague. A neighborhood is similar to a genre, but less exacting borders, since most everything is really a cross of thousands of different styles. I love some slapstick like The Blues Brothers and Police Squad!, but I'm not a huge fan of Monty Python (which, according to IMDB, is their highest-rated slapstick comedy). So the neighborhood I am closest to are what others who enjoyed Blues Brothers and Police Squad enjoyed, with the subset of Monty Python-lovers probably being furthest from me.

Since the author puts it so much better than me, I'll quote here:
If a movie is near a user -- in the same neighborhood, so to speak -- it can be predicted that that user will probably like that movie, even if the user did not specifically rate it. Movies that are universally liked tended to move toward the center of the model ("Shawshank Redemption" being closest to center), disliked movies moved toward the outside. In practice, I found that giving 12 or so dimensions, rather than just 3, worked a lot better, allowing a much richer categorization, and allowing each neighborhood to be adjacent to a great many other neighborhoods. There are several other layers of complexity in order to get the best results, but the gist of the approach is just as simple as described.

The point, of course, is that this system is very evolution-like, in that lots of messy data, with very little apparent "intelligence," processed by a simple iterative algorithm, can find sophisticated equilibria with a great deal of precision. Looking directly at the raw data, such as at an individual user's set of ratings, would indicate a lot more slop than is apparent in the final model. The system doesn't "know" that a movie is a science fiction movie, any more than natural selection "knows" why a particular mutation in the DNA increases the chance of an animal surviving to adulthood. Nonetheless, it works, against all intuition.


Anyway, go on and read the end of it...it's not too long, and I think it's quite interesting.

No comments: