|Classification tree for predicting vulnerability to famine|
Here's an interesting example from the paper Identifying Indicators of Vulnerability to Famine and Chronic Food Insecurity by Yohannes and Webb, showing predictors of vulnerability to famine based on a survey of households. The image shows all the predictors that were identified by the tree, which appear below each circle. Each predictor is a binary variable and you go right or left depending on the value of the predictor. It is easiest to start reading from the top, with an household in mind.
Our goal is to generate groups of households with similar profiles, where profiles are the combination of answers to different survey questions.
- Drag the outcome of interest into the Values area
- Find the first predictor that best splits the profiles and drag it into the Row Label field*.
- Given the first predictor, find the next predictor to further split the profiles, and drag into the Row Label field** .
- Given the first two splits, find the next predictor to further split the profiles (could also be one of the earlier variables) and drag into the Row Label field***
- Continue this process until some over-fitting criterion is reached
You might imagine the final result as a really crowded Pivot Table, with multiple predictors in the Row Label fields. This is indeed quite close, except for two slight differences:
* Each time a predictor is dragged into the Row or Column Labels fields, it is converted into a binary variable, creating only two classes. For example,
- Gender would not change (Female/Male)
- Country could be turned into "India/Other".
- noncereal yield was discretized into "Above/below 4.7".
** After a predictor is dragged, the next predictor is actually dragged only into one of the two splits of the first predictor. In our example, after dragging noncereal yield (Above/Below 4.7), the predictor oxen owned (Above/Below 1.5) only applies to noncereal yield Below 4.7.
*** We also note that a tree can "drag" a predictor more than once into the Row Labels fields. For example, TLU/capita appears twice in the tree, so theoretically in the pivot table we'd drag TLU/capita after oxen owned and again after crop diversity.
So where is the "intelligence" of a tree over an elaborate pivot table? First, it automatically determines which predictor is the best one to use at each stage. Second, it automatically determines the value on which to split. Third, it knows when to stop, to avoid over-fitting the data. In a pivot table, the user would have to determine which predictors to include, their order, and what are the critical values to split on. And finally, this complex process going on behind the scenes is easily interpretable by a tree chart.