Friday, November 30, 2007

Insights from the Netflix contest

The neat recent Wall Street Journal article Netflix Aims to Refine Art of Picking Films (Nov 20, 2007) was sent to me by Moshe Cohen, one of my dedicated ex-data-mining-course students. In the article, a spokesman from Netflix demystifies some of the winning techniques in the Netflix $1 million contest. OK, not really demystifying, but revealing two interesting insights:

1) Some teams joined forces by combining their predictions to obtain improved predictions (without disclosing their actual algorithms to each other). Today, for instance, the third best team on the Netflix Leaderboard is "When Gravity and Dinosaurs Unite", which is the result of two teams combining their predictions(Gravity from Hungary and Dinosaur Planet from US). This is an example of the "portfolio approach" which says that combining predictions from a variety of methods (and sometimes a variety of datasets) can lead to higher performance, just like stock portfolios.

2) AT&T, who is currently in the lead, takes an approach that includes 107 different techniques (blended in different ways). You can get a glimpse of these methods in their publicly available document written by Robert Bell, Yehuda Koren, and Chris Volinsky (kudos for the "open-source"!). They use regression models, k-nearest-neighbor methods, collaborative filtering, "portfolios" of the different methods, etc. Again, this shows that "looking" at data from multiple views is usually very beneficial. Like painkillers, a variety is useful because sometimes one works but other times another works better.

Please note that this does NOT suggest that a portfolio approach with painkillers is recommended!

Thursday, November 08, 2007

Good and bad of classification/regression trees

Classification and Regression Trees are great for both explanatory and predictive modeling. Although data driven, they provide transparency about the resulting classifier are are far from being a blackbox. For this reason trees are often in applications that require transparency, such as insurance or credit approvals.

Trees are also used during the exploratory phase for the purpose of variable selection: variables that show up at the top layers of the tree are good candidates as "key players".

Trees do not make any distributional assumptions and are also quite robust to outliers. They can nicely capture local pockets of behavior that would require complicated interaction terms in regression type models. Although this sounds like the perfect tool, there is no free lunch. First, a tree usually requires lots of data: the tree is built on a training set; then, in CART trees the validation set is used to prune the tree for avoiding over-fitting; Finally, a test dataset is needed for evaluating the actual performance of the tree on new data. Second, a tree can be pretty computationally expensive to create, as a function of the number of variables. Building a tree requires evaluating a huge number of splits on all possible variables and their values (especially if they are numeric). The good news is that once the tree is built, scoring new data is cheap (unlike k-nearest-neighbor algorithms that are also very costly in scoring new data).

As in any prediction task, the greatest danger is that of over-fitting. In trees this is avoided by either stopping tree growth (e.g., in CHAID type trees that are popular in marketing) , or by growing the entire tree and then pruning it. In the latter case, when comparing the full and pruned tree there will usually be a huge difference in the tree sizes. However, there could be cases where the two trees have similar out-of-sample performance: this happens when the data contain very little noise. In that case over-fitting is not substantial. You can find such an example in our book Data Mining for Business Intelligence ("Acceptance of Personal Loan", chap 7 pp. 120-129).