January 7, 2008

Netflix Contest has Produced Prizes for the Analytics Community

In June 2007, we posted about the "Netflix Prize" - a contest promoted by analytics savvy movie-rental-house Netflix.

The goal: improve the accuracy of the existing Cinewatch movie recommendation system.

The prize: $1 million

Fifteen months along, and no model has come forward meeting the victory threshold of 10% improvement on matching accuracy. Fortunately, for everyone that doesn't work at Netflix, this contest has produced something of value.

The discussions and attempts conceived from this contest have provided those interested in analytics new perspectives and questions to ponder as we seek to analytically quantify and predict preference and behavior.

This article discusses some of the most interesting insights thus far:

"Open Questions" (text mining) has emerged as a theme to "fine-tune" the specificity of predictive models. Allowing individuals an opportunity to express, instead of forcing them to conform entirely to a pre-defined format, is really emerging as a more nuanced and "high-touch" approach. As I have posted previously, there is software emerging that is making great strides towards allowing text mining to be a pragmatic tool. Discriminate choice models of "ultimate" giving destination preference (athletics, fine arts, brick and mortar) for example, could be greatly enhanced by appropriately applied text mining.

Another model suggested that information about tastes as related genre, language, actors, directors etc, was surprisingly powerless compared to the star ranking of the movie itself. Perhaps this suggests that second tier "affiliation" data (I love Tom Hanks, or in the fundraising field, I was a Sociology major) may be more ambiguous than standard industry assumptions. At minimum, this revelation suggests that more consideration should be given to the importance of the top preference metric (for movies its a star rating, for fundraising, it is giving to the institution).

The $1,000,000 Netflix Prize competition has produced interesting results, even if no winner, 15 months in. Some of those results are a bit surprising; others we should have expected but didn't anticipate. So while participants haven't yet bettered the accuracy of Netflix's Cinematch recommendation algorithm by 10%, the threshold to win the $1 million prize, we can still take away lessons about predictive-analytics fundamentals.

Read More

Labels:

September 23, 2009

Netflix prize awarded, a new challenge is made

Josh and I both have followed the Netflix challenge, an open-source style competition to beat out their movie matching algorithms, with a good deal of interest.

I hope that predictive analytics can have a more collaborative effort in other disciplines as well, allowing us to all benefit from insights and successes.

Note that Netflix has enlisted a new challenge, predicting movie selection based purely of bio-demographic and geographic data. This should be very intersting.

A $1 Million Research Bargain for Netflix, and Maybe a Model for Others

Even the near-miss losers in the
Netflix million-dollar-prize competition seemed to have few regrets.

Netflix, the movie rental company, announced on Monday that a seven-man team was the winner of its closely watched three-year contest to improve its Web site’s movie recommendation system. That was expected, but the surprise was in the nail-biter finish.

Read more

Labels: ,

January 6, 2009

The "Naploeon Dynamite Problem"

In pondering my return to active posting on this blog, I came back to this article from late November concerning the Netflix challenge. Josh wrote a bit about this competition some months back—basically Netflix has created an “open source competition” to see if someone can improve upon on the accuracy of their movie matching algorithm. When you select one title, Netflix suggests others—and they want to increase the accuracy that you will enjoy their recommendation based upon pre-existing selections/tastes.

The competition has become an intense “hobby” for many interested in data mining and analytics (Josh downloaded the data set to work on it as well), and the sharing of these results has produced an issue contestants are calling the “Napoleon Dynamite Problem.” Basically, Napoleon Dynamite is a movie most everyone who reviews it loves, or hates, and while that rating has strong predictive power, there is little discernible pattern between who would love or who would hate the movie. One of the strongest predictors in the data set is displaying an almost random distribution. In other words, this powerful predictor appears to be an outlier.

How should a contestant proceed? As a very popular movie which elicits strong predictive responses (love or hate, not just like or dislike) Napoleon Dynamite is a significant point in the Netflix data landscape. However, the lack of pattern between those with similar ratings has rendered contestants' models fuzzy, or worse.

This brought me back to issues I encounter almost daily in my own analytics work: how to deal with outliers. Whether it is building a predictive model, or creating simple algorithmic projections of future giving, there always seem to be a dialog between myself and clients regarding what should be included or excluded.

Consider Example 1:

Total Giving
FY04 $14,000,000
FY05 $16,000,000
FY06 $15,500,000
FY07 $15,800,000
FY08 $26,500,000


This demonstrates a common issue seen in fundraising: how do you account for large gifts in projections (dramatic increase in FY08)? If this was a realized planned gift, or possibly even a major gift, some would argue to exclude it to not erroneously affect future projections. The gift was made though right? Is FY08 giving sustainable? How accurate can projections of future giving be, if you exclude historical realized giving?

For Example 2, lets consider building a predictive model where you may run into issues with deceased records, especially in relatively “younger” institutions. You can produce a model on living records (they are the only constituents that can still give major gifts!), but what if half or more of the major gifts at an institution came from records flagged as deceased? Is it necessary to lose roughly 50% of your sample? Is your model inaccurately skewed for not considering donors, many of whom have a data-rich profile, who made major gifts when they were alive, but have since passed? Does inclusion of deceased records produce “generational” predictive phenomenon with only minor relevance to today’s living donor pool?

It is difficult to produce “rules” on outlier issues like these—many times decisions on how to approach these situations can be relative to a specific institution or project goals. Consider though, the “Napoleon Dynamites” in your work, and continue to experiment with ideas, and challenge your own work by creating new ways to utilize the data at your finger tips to answer your own questions.

If You Liked This, You’re Sure to Love That
By CLIVE THOMPSON
Published: November 21, 2008


THE “NAPOLEON DYNAMITE” problem is driving Len Bertoni crazy. Bertoni is a 51-year-old “semiretired” computer scientist who lives an hour outside Pittsburgh. In the spring of 2007, his sister-in-law e-mailed him an intriguing bit of news: Netflix, the Web-based DVD-rental company, was holding a contest to try to improve Cinematch, its “recommendation engine.” The prize: $1 million.

Read More

Labels: ,

June 4, 2007

Netflix Prize Still Awaits a Movie Seer

Since its inception Netflix has employed analytics to drive growth and increase their competitive advantage. Lasf fall they launched a contest, seeking the brains and skills of analytics gurus outside their company. The goal: improve the accuracy of the existing Cinewatch movie recommendation system. The prize: $1 million.

The following article from the New York Times provides a summary of the contest results to date. Details about the contest are available at Netflix Prize.

Sometimes a good idea becomes a great one after it is set loose.

Last October, Netflix, the online movie rental service, announced that it would award $1 million to the first person or team who can devise a system that is 10 percent more accurate than the company’s current system for recommending movies that customers would like.

About 18,000 teams from more than 150 countries — using ideas from machine learning, neural networks, collaborative filtering and data mining — have submitted more than 12,000 sets of guesses. And the improvement level to Netflix’s rating system is now at 7.42 percent.

Read more

Labels:

February 4, 2011

The "Netflix prize" model...only this time more serious

Josh and I have both posted about the Netflix Prize...drawn to both the idea of creating a very accurate preference/choice models with a very large menu of outcomes, as well as the crowd-sourced approach to solving the problem (and there was in fact a winner).

Well I am very excited to share this approach has been applied to a more "serious" problem: build a model that will predict upcoming hospitalizations. The end result is far more lofty than a movie pairing to "Young Frankenstein". This project hopes to identify individuals at greatest risk for imminent adverse events before they happen, creating an "early detection" system to save lives and reduce overall costs.

A $3mil prize is also a great incentive...keep an eye on this content.

Netflix Prize-Style Competition Predicts Hospitalizations
What if you could predict if a given patient were at a higher risk for hospitalization in the coming year? You could potentially save money, and lives, by pulling out all the stops to prevent that hospital visit, if possible. And that's why the Heritage Provider Network (HPN) has put up $3 million for a Netflix Prize-style competition that will pit coders against each other to devise the most effective predictive algorithm for incipient hospitalizations. HPN will be announcing a launch date for the prize this week.

Read more

Labels: ,

April 26, 2007

Competing on Analytics

Here is a book I am eagerly anticipating. Thomas Davenport wrote an article of the same name for Harvard Business Review in January of 2006. I found the article enormously helpful for equipping researchers to make the case for building internal analytics programs. I will circle back to write a review in upcoming months.

The New York Police Department does it. The Harrah's casinos in Las Vegas do it. And businesses like Netflix are built entirely on the basis of it. It, in this case, is using the sophisticated analysis of data -- or "analytics" -- to drive decisions. As a concept, analytics is neither new nor complicated. Any dieter standing on the bathroom scale can attest that numbers are a more reliable source of information than intuition or a spouse's kind opinion. You might feel fit, but the numbers don't lie.

Read More

Labels: