Friday, April 20, 2007

Classification Trees: CART vs. CHAID

When it comes to classification trees, there are three major algorithms used in practice. CART ("Classification and Regression Trees"), C4.5, and CHAID.

All three algorithms create classification rules by constructing a tree-like structure of the data. However, they are different in a few important ways.

The main difference is in the tree construction process. In order to avoid over-fitting the data, all methods try to limit the size of the resulting tree. CHAID (and variants of CHAID) achieve this by using a statistical stopping rule that discontinuous tree growth. In contrast, both CART and C4.5 first grow the full tree and then prune it back. The tree pruning is done by examining the performance of the tree on a holdout dataset, and comparing it to the performance on the training set. The tree is pruned until the performance is similar on both datasets (thereby indicating that there is no over-fitting of the training set). This highlights another difference between the methods: CHAID and C4.5 use a single dataset to arrive at the final tree, whereas CART uses a training set to build the tree and a holdout set to prune it.

A difference between CART and the other two is that the CART splitting rule allows only binary splits (e.g., "if Income<$50K then X, else Y"), whereas C4.5 and CHAID allow multiple splits. In the latter, trees sometimes look more like bushes. CHAID has been especially popular in marketing research, in the context of market segmentation. In other areas, CART and C4.5 tend to be more popular. One important difference that came to my mind is in the goal that CHAID is most useful for, compared to the goal of CART. To clarify my point, let me first explain the CHAID mechanism in a bit more detail. At each split, the algorithm looks for the predictor variable that if split, most "explains" the category response variable. In order to decide whether to create a particular split based on this variable, the CHAID algorithm tests a hypothesis regarding dependence between the splitted variable and the categorical response(using the chi-squared test for independence). Using a pre-specified significance level, if the test shows that the splitted variable and the response are independent, the algorithm stops the tree growth. Otherwise the split is created, and the next best split is searched. In contrast, the CART algorithm decides on a split based on the amount of homogeneity within class that is achieved by the split. And later on, the split is reconsidered based on considerations of over-fitting. Now I get to my point: It appears to me that CHAID is most useful for analysis, whereas CART is more suitable for prediction. In other words, CHAID should be used when the goal is to describe or understand the relationship between a response variable and a set of explanatory variables, whereas CART is better suited for creating a model that has high prediction accuracy of new cases.

In the book Statistics Methods and Applications by Hill and Lewicki, the authors mention another related difference, related to CART's binary splits vs. CHAIDs multiple-category splits: "CHAID often yields many terminal nodes connected to a single branch, which can be conveniently summarized in a simple two-way table with multiple categories for each variable of dimension of the table. This type of display matches well the requirements for research on market segmentation... CART will always yield binary trees, which sometimes can not be summarized as efficiently for interpretation and/or presentation". In other words, if the goal is explanatory, CHAID is better suited for the task.

There are additional differences between the algorithms, which I will not mention here. Some can be found in the excellent Statistics Methods and Applications by Hill and Lewicki.
Post a Comment