Tuesday, May 13, 2008

Why zipcodes take over trees

I few weeks ago I went up to West Point to present a talk at their 2008 Statistics Workshop. Another speaker was Professor Wei-Yin Loh, from Univ of Wisconsin. He gave a very interesting talk that touched upon an interesting aspect of classification and regression trees: that of selection bias. Because splits in trees are constructed by trying out all possible variables at all possible values, when a variables with lots and lots of categories is considered (e.g., Zipcode), it will likely get selected! Professor Loh developed his own tree software GUIDE that overcomes this issue. The principle is first to choose which predictors to include (based on chi-square tests of independence), and only after a predictor is chosen, the search for the right split is done.