BzST | Business Analytics, Statistics, Teaching: classification trees

Showing posts with label classification trees. Show all posts

Saturday, September 01, 2012

Trees in pivot table terminology

Recently, I've been requested by non-data-mining colleagues to explain how Classification and Regression Trees work. While a detailed explanation with examples exists in my co-authored textbook Data Mining for Business Intelligence, I found that the following explanation worked well with people who are familiar with Excel's Pivot Tables:

Classification tree for predicting vulnerability to famine

Suppose the goal is to generate predictions for some variable, numerical or categorical, given a set of predictors. The idea behind trees is to create groups of records with similar profiles in terms of their predictors, and then average the outcome variable of interest to generate a prediction.

Here's an interesting example from the paper Identifying Indicators of Vulnerability to Famine and Chronic Food Insecurity by Yohannes and Webb, showing predictors of vulnerability to famine based on a survey of households. The image shows all the predictors that were identified by the tree, which appear below each circle. Each predictor is a binary variable and you go right or left depending on the value of the predictor. It is easiest to start reading from the top, with an household in mind.

Our goal is to generate groups of households with similar profiles, where profiles are the combination of answers to different survey questions.

Using the language of pivot tables, our predictions will be in the Values field, and we can use the Row (or Column) Labels to break down the predictors. What does the tree do? Here's a "pivot table" description:

Drag the outcome of interest into the Values area
Find the first predictor that best splits the profiles and drag it into the Row Label field*.
Given the first predictor, find the next predictor to further split the profiles, and drag into the Row Label field** .
Given the first two splits, find the next predictor to further split the profiles (could also be one of the earlier variables) and drag into the Row Label field***
Continue this process until some over-fitting criterion is reached

You might imagine the final result as a really crowded Pivot Table, with multiple predictors in the Row Label fields. This is indeed quite close, except for two slight differences:

* Each time a predictor is dragged into the Row or Column Labels fields, it is converted into a binary variable, creating only two classes. For example,

Gender would not change (Female/Male)
Country could be turned into "India/Other".
noncereal yield was discretized into "Above/below 4.7".

** After a predictor is dragged, the next predictor is actually dragged only into one of the two splits of the first predictor. In our example, after dragging noncereal yield (Above/Below 4.7), the predictor oxen owned (Above/Below 1.5) only applies to noncereal yield Below 4.7.

*** We also note that a tree can "drag" a predictor more than once into the Row Labels fields. For example, TLU/capita appears twice in the tree, so theoretically in the pivot table we'd drag TLU/capita after oxen owned and again after crop diversity.

So where is the "intelligence" of a tree over an elaborate pivot table? First, it automatically determines which predictor is the best one to use at each stage. Second, it automatically determines the value on which to split. Third, it knows when to stop, to avoid over-fitting the data. In a pivot table, the user would have to determine which predictors to include, their order, and what are the critical values to split on. And finally, this complex process going on behind the scenes is easily interpretable by a tree chart.

Thursday, November 08, 2007

Good and bad of classification/regression trees

Classification and Regression Trees are great for both explanatory and predictive modeling. Although data driven, they provide transparency about the resulting classifier are are far from being a blackbox. For this reason trees are often in applications that require transparency, such as insurance or credit approvals.

Trees are also used during the exploratory phase for the purpose of variable selection: variables that show up at the top layers of the tree are good candidates as "key players".

Trees do not make any distributional assumptions and are also quite robust to outliers. They can nicely capture local pockets of behavior that would require complicated interaction terms in regression type models. Although this sounds like the perfect tool, there is no free lunch. First, a tree usually requires lots of data: the tree is built on a training set; then, in CART trees the validation set is used to prune the tree for avoiding over-fitting; Finally, a test dataset is needed for evaluating the actual performance of the tree on new data. Second, a tree can be pretty computationally expensive to create, as a function of the number of variables. Building a tree requires evaluating a huge number of splits on all possible variables and their values (especially if they are numeric). The good news is that once the tree is built, scoring new data is cheap (unlike k-nearest-neighbor algorithms that are also very costly in scoring new data).

As in any prediction task, the greatest danger is that of over-fitting. In trees this is avoided by either stopping tree growth (e.g., in CHAID type trees that are popular in marketing) , or by growing the entire tree and then pruning it. In the latter case, when comparing the full and pruned tree there will usually be a huge difference in the tree sizes. However, there could be cases where the two trees have similar out-of-sample performance: this happens when the data contain very little noise. In that case over-fitting is not substantial. You can find such an example in our book Data Mining for Business Intelligence ("Acceptance of Personal Loan", chap 7 pp. 120-129).

Friday, April 20, 2007

Classification Trees: CART vs. CHAID

When it comes to classification trees, there are three major algorithms used in practice. CART ("Classification and Regression Trees"), C4.5, and CHAID.

All three algorithms create classification rules by constructing a tree-like structure of the data. However, they are different in a few important ways.

The main difference is in the tree construction process. In order to avoid over-fitting the data, all methods try to limit the size of the resulting tree. CHAID (and variants of CHAID) achieve this by using a statistical stopping rule that discontinuous tree growth. In contrast, both CART and C4.5 first grow the full tree and then prune it back. The tree pruning is done by examining the performance of the tree on a holdout dataset, and comparing it to the performance on the training set. The tree is pruned until the performance is similar on both datasets (thereby indicating that there is no over-fitting of the training set). This highlights another difference between the methods: CHAID and C4.5 use a single dataset to arrive at the final tree, whereas CART uses a training set to build the tree and a holdout set to prune it.

A difference between CART and the other two is that the CART splitting rule allows only binary splits (e.g., "if Income<$50K then X, else Y"), whereas C4.5 and CHAID allow multiple splits. In the latter, trees sometimes look more like bushes. CHAID has been especially popular in marketing research, in the context of market segmentation. In other areas, CART and C4.5 tend to be more popular. One important difference that came to my mind is in the goal that CHAID is most useful for, compared to the goal of CART. To clarify my point, let me first explain the CHAID mechanism in a bit more detail. At each split, the algorithm looks for the predictor variable that if split, most "explains" the category response variable. In order to decide whether to create a particular split based on this variable, the CHAID algorithm tests a hypothesis regarding dependence between the splitted variable and the categorical response(using the chi-squared test for independence). Using a pre-specified significance level, if the test shows that the splitted variable and the response are independent, the algorithm stops the tree growth. Otherwise the split is created, and the next best split is searched. In contrast, the CART algorithm decides on a split based on the amount of homogeneity within class that is achieved by the split. And later on, the split is reconsidered based on considerations of over-fitting. Now I get to my point: It appears to me that CHAID is most useful for analysis, whereas CART is more suitable for prediction. In other words, CHAID should be used when the goal is to describe or understand the relationship between a response variable and a set of explanatory variables, whereas CART is better suited for creating a model that has high prediction accuracy of new cases.

In the book Statistics Methods and Applications by Hill and Lewicki, the authors mention another related difference, related to CART's binary splits vs. CHAIDs multiple-category splits: "CHAID often yields many terminal nodes connected to a single branch, which can be conveniently summarized in a simple two-way table with multiple categories for each variable of dimension of the table. This type of display matches well the requirements for research on market segmentation... CART will always yield binary trees, which sometimes can not be summarized as efficiently for interpretation and/or presentation". In other words, if the goal is explanatory, CHAID is better suited for the task.

There are additional differences between the algorithms, which I will not mention here. Some can be found in the excellent Statistics Methods and Applications by Hill and Lewicki.