From http://blog.excelmasterseries.com |

*m*categories (e.g.,

*m*countries). First, we must factor it into

*m*binary variables called dummy variables,

*D1, D2,..., Dm*(e.g.,

*D1*=1 if Country=Japan and 0 otherwise;

*D2*=1 if Country=USA and 0 otherwise, etc.) Then, we include

*m-1*of the dummy variables in the regression model. The major point is to

**exclude one of the**to avoid redundancy. The excluded dummy's category is called the "reference category". Mathematically, it does not matter which dummy you exclude, although the resulting coefficients will be interpreted relative to the reference category, so if interpretation is important it's useful to choose the reference category as the one we most want to compare against.

*m*dummy variablesIn linear and logistic regression models, including all

*m*variables will lead to perfect multicollinearity, which will typically cause failure of the estimation algorithm. Smarter software will identify the problem and drop one of the dummies for you. That is why every statistics book or course on regression will emphasize the need to drop one of the dummy variables.

Now comes the surprising part: when using categorical predictors

**in machine learning algorithms**such as k-nearest neighbors (kNN) or classification and regression trees,

**we keep all**. The reason is that in such algorithms we do not create linear combinations of all predictors. A tree, for instance, will choose a subset of the predictors. If we leave out one dummy, then if that category differs from the other categories in terms of the output of interest, the tree will not be able to detect it! Similarly, dropping a dummy in kNN would not incorporate the effect of belonging to that category into the distance used.

*m*dummy variablesThe only case where dummy variable inclusion is treated equally across methods is for a two-category predictor, such as Gender. In that case a single dummy variable will suffice in regression, kNN, CART, or any other data mining method.