Friday, April 20, 2007

Classification Trees: CART vs. CHAID

When it comes to classification trees, there are three major algorithms used in practice. CART ("Classification and Regression Trees"), C4.5, and CHAID.

All three algorithms create classification rules by constructing a tree-like structure of the data. However, they are different in a few important ways.

The main difference is in the tree construction process. In order to avoid over-fitting the data, all methods try to limit the size of the resulting tree. CHAID (and variants of CHAID) achieve this by using a statistical stopping rule that discontinuous tree growth. In contrast, both CART and C4.5 first grow the full tree and then prune it back. The tree pruning is done by examining the performance of the tree on a holdout dataset, and comparing it to the performance on the training set. The tree is pruned until the performance is similar on both datasets (thereby indicating that there is no over-fitting of the training set). This highlights another difference between the methods: CHAID and C4.5 use a single dataset to arrive at the final tree, whereas CART uses a training set to build the tree and a holdout set to prune it.

A difference between CART and the other two is that the CART splitting rule allows only binary splits (e.g., "if Income<$50K then X, else Y"), whereas C4.5 and CHAID allow multiple splits. In the latter, trees sometimes look more like bushes. CHAID has been especially popular in marketing research, in the context of market segmentation. In other areas, CART and C4.5 tend to be more popular. One important difference that came to my mind is in the goal that CHAID is most useful for, compared to the goal of CART. To clarify my point, let me first explain the CHAID mechanism in a bit more detail. At each split, the algorithm looks for the predictor variable that if split, most "explains" the category response variable. In order to decide whether to create a particular split based on this variable, the CHAID algorithm tests a hypothesis regarding dependence between the splitted variable and the categorical response(using the chi-squared test for independence). Using a pre-specified significance level, if the test shows that the splitted variable and the response are independent, the algorithm stops the tree growth. Otherwise the split is created, and the next best split is searched. In contrast, the CART algorithm decides on a split based on the amount of homogeneity within class that is achieved by the split. And later on, the split is reconsidered based on considerations of over-fitting. Now I get to my point: It appears to me that CHAID is most useful for analysis, whereas CART is more suitable for prediction. In other words, CHAID should be used when the goal is to describe or understand the relationship between a response variable and a set of explanatory variables, whereas CART is better suited for creating a model that has high prediction accuracy of new cases.

In the book Statistics Methods and Applications by Hill and Lewicki, the authors mention another related difference, related to CART's binary splits vs. CHAIDs multiple-category splits: "CHAID often yields many terminal nodes connected to a single branch, which can be conveniently summarized in a simple two-way table with multiple categories for each variable of dimension of the table. This type of display matches well the requirements for research on market segmentation... CART will always yield binary trees, which sometimes can not be summarized as efficiently for interpretation and/or presentation". In other words, if the goal is explanatory, CHAID is better suited for the task.

There are additional differences between the algorithms, which I will not mention here. Some can be found in the excellent Statistics Methods and Applications by Hill and Lewicki.


ilyal said...

A comment on the difference between CART and CHAID splitting criteria: At least for the case of binary splitting, if you do little algebra you may discover that the increase in homogeneity after splitting data into two child nodes as measured by GINI index is basically proportional to the chi-square statistic for independence on that split. In the light of this, I think the methodological distinction between the former as more useful for the task of “prediction” and the latter for “explanation” appear somewhat superficial

Unknown said...

I need some help. CART is more useful in prediction while CHAID is more useful in analysis. Is there any possibility that the results of the two tool will meet half way around?
is it possible for the two to have results for both analysis and prediction? please elaborate.

Galit Shmueli said...

Fellah - there are two issues here. The general issue is whether there is a single solution that provides best predictive power and strongest explanatory power. I believe the answer is no. You need to prioritize whether the main goal is predictive, but you want some explanatory power as second-level goal, or the other way around. This is true for any predictive model vs. explanatory model.

With respect to CART and CHAID, I suppose that in some cases you will get similar trees (if the signal is very strong). You can use the commonalities to infer about robustness (single trees are known to suffer from robustness). I am not aware of research on the differences between CART and CHAID along these lines (only along lines of predictive power).