Wednesday, August 19, 2015

Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Recently I've had discussions with several instructors of data mining courses about a fact that is often left out of many books, but is quite important: different treatment of dummy variables in different data mining methods.

Statistics courses that cover linear or logistic regression teach us to be careful when including a categorical predictor variable in our model. Suppose that we have a categorical variable with m categories (e.g., m countries). First, we must factor it into m binary variables called dummy variables, D1, D2,..., Dm (e.g., D1=1 if Country=Japan and 0 otherwise; D2=1 if Country=USA and 0 otherwise, etc.) Then, we include m-1 of the dummy variables in the regression model. The major point is to exclude one of the m dummy variables to avoid redundancy. The excluded dummy's category is called the "reference category". Mathematically, it does not matter which dummy you exclude, although the resulting coefficients will be interpreted relative to the reference category, so if interpretation is important it's useful to choose the reference category as the one we most want to compare against.

In linear and logistic regression models, including all m variables will lead to perfect multicollinearity, which will typically cause failure of the estimation algorithm. Smarter software will identify the problem and drop one of the dummies for you. That is why every statistics book or course on regression will emphasize the need to drop one of the dummy variables.

Now comes the surprising part: when using categorical predictors in machine learning algorithms such as k-nearest neighbors (kNN) or classification and regression trees, we keep all m dummy variables. The reason is that in such algorithms we do not create linear combinations of all predictors. A tree, for instance, will choose a subset of the predictors. If we leave out one dummy, then if that category differs from the other categories in terms of the output of interest, the tree will not be able to detect it! Similarly, dropping a dummy in kNN would not incorporate the effect of belonging to that category into the distance used.

The only case where dummy variable inclusion is treated equally across methods is for a two-category predictor, such as Gender. In that case a single dummy variable will suffice in regression, kNN, CART, or any other data mining method.

No comments: