Recently, I've been requested by non-data-mining colleagues to explain how Classification and Regression Trees work. While a detailed explanation with examples exists in my co-authored textbook
Data Mining for Business Intelligence, I found that the following explanation worked well with people who are familiar with Excel's Pivot Tables:
|
Classification tree for predicting vulnerability to famine |
Suppose the goal is to generate predictions for some variable, numerical or categorical, given a set of predictors. The idea behind trees is to create groups of records with similar profiles in terms of their predictors, and then average the outcome variable of interest to generate a prediction.
Here's an interesting example from the paper
Identifying Indicators of Vulnerability to Famine and Chronic Food Insecurity by Yohannes and Webb, showing predictors of vulnerability to famine based on a survey of households. The image shows all the predictors that were identified by the tree, which appear below each circle. Each predictor is a binary variable and you go right or left depending on the value of the predictor. It is easiest to start reading from the top, with an household in mind.
Our goal is to generate groups of households with similar profiles, where
profiles are the combination of answers to different survey questions.
Using the language of pivot tables, our predictions will be in the Values field, and we can use the Row (or Column) Labels to break down the predictors. What does the tree do? Here's a "pivot table" description:
- Drag the outcome of interest into the Values area
- Find the first predictor that best splits the profiles and drag it into the Row Label field*.
- Given the first predictor, find the next predictor to further split the profiles, and drag into the Row Label field** .
- Given the first two splits, find the next predictor to further split the profiles (could also be one of the earlier variables) and drag into the Row Label field***
- Continue this process until some over-fitting criterion is reached
You might imagine the final result as a really crowded Pivot Table, with multiple predictors in the Row Label fields. This is indeed quite close, except for two slight differences:
* Each time a predictor is dragged into the Row or Column Labels fields, it is converted into a binary variable, creating only two classes. For example,
- Gender would not change (Female/Male)
- Country could be turned into "India/Other".
- noncereal yield was discretized into "Above/below 4.7".
** After a predictor is dragged, the next predictor is actually dragged only into one of the two splits of the first predictor. In our example, after dragging
noncereal yield (Above/Below 4.7), the predictor
oxen owned (Above/Below 1.5) only applies to
noncereal yield Below 4.7.
*** We also note that a tree can "drag" a predictor more than once into the Row Labels fields. For example,
TLU/capita appears twice in the tree, so theoretically in the pivot table we'd drag
TLU/capita after
oxen owned and again after
crop diversity.
So where is the "intelligence" of a tree over an elaborate pivot table? First, it automatically determines
which predictor is the best one to use at each stage. Second, it automatically determines
the value on which to split. Third, it knows
when to stop, to avoid over-fitting the data. In a pivot table, the user would have to determine which predictors to include, their order, and what are the critical values to split on. And finally, this complex process going on behind the scenes is easily interpretable by a tree chart.