Thursday, March 06, 2014

The use of dummy variables in predictive algorithms

Anyone who has taken a course in statistics that covers linear regression has heard some version of the rule regarding pre-processing categorical predictors with more than two categories and the need to factor them into binary dummy/indicator variables:
"If a variable has k levels, you can create only k-1 indicators. You have to choose one of the k categories as a "baseline" and leave out its indicator." (from Business Statistics by Sharpe, De Veaux & Velleman)
Technically, one can easily create k dummy variables for k categories in any software. The reason for not including all k dummies as predictors in a linear regression is to avoid perfect multicollinearity, where an exact linear relationship exists between the k predictors. Perfect multicollinearity causes computational and interpretation challenges (see slide #6). This k-dummies issue is also called the Dummy Variable Trap.

While these guidelines are required for linear regression, which other predictive models require them? The k-1 dummy rule applies to models where all the predictors are considered together, as a linear combination. Therefore, in addition to linear regression models, the rule would apply to logistic regression models, discriminant analysis, and in some cases to neural networks.

What happens if we use k-1 dummies in other predictive models? 
The choice of the dropped dummy variable does not affect the results of regression models, but can affect other methods. For instance, let's consider a classification/regression tree. In a tree, predictors are evaluated one-by-one, and therefore omitting one of the k dummies can result in an inferior predictive model. For example, suppose we have 12 monthly dummies and that in reality only January is different from other months (the outcome differs between January and other months). Now, we run a tree omitting the January dummy as an input and keep the other 11 monthly dummies. The only way the tree might discover the January effect is by creating 11 levels of splits by each of the dummies. This is much less efficient than a single split on the January dummy.

This post is inspired by a discussion in the recent Predictive Analytics 1 online course. This topic deserves more than a short post, yet I haven't seen a thorough discussion anywhere.

No comments: