Thursday, November 08, 2007

Good and bad of classification/regression trees

Classification and Regression Trees are great for both explanatory and predictive modeling. Although data driven, they provide transparency about the resulting classifier are are far from being a blackbox. For this reason trees are often in applications that require transparency, such as insurance or credit approvals.

Trees are also used during the exploratory phase for the purpose of variable selection: variables that show up at the top layers of the tree are good candidates as "key players".

Trees do not make any distributional assumptions and are also quite robust to outliers. They can nicely capture local pockets of behavior that would require complicated interaction terms in regression type models. Although this sounds like the perfect tool, there is no free lunch. First, a tree usually requires lots of data: the tree is built on a training set; then, in CART trees the validation set is used to prune the tree for avoiding over-fitting; Finally, a test dataset is needed for evaluating the actual performance of the tree on new data. Second, a tree can be pretty computationally expensive to create, as a function of the number of variables. Building a tree requires evaluating a huge number of splits on all possible variables and their values (especially if they are numeric). The good news is that once the tree is built, scoring new data is cheap (unlike k-nearest-neighbor algorithms that are also very costly in scoring new data).

As in any prediction task, the greatest danger is that of over-fitting. In trees this is avoided by either stopping tree growth (e.g., in CHAID type trees that are popular in marketing) , or by growing the entire tree and then pruning it. In the latter case, when comparing the full and pruned tree there will usually be a huge difference in the tree sizes. However, there could be cases where the two trees have similar out-of-sample performance: this happens when the data contain very little noise. In that case over-fitting is not substantial. You can find such an example in our book Data Mining for Business Intelligence ("Acceptance of Personal Loan", chap 7 pp. 120-129).

No comments: