Tuesday, July 24, 2012

Linear regression for binary outcome: even better news

I recently attended the 8th World Congress in Probability and Statistics, where I heard an interesting talk by Andy Tsao. His talk "Naivity can be good: a theoretical study of naive regression" (Abstract #0586) was about the use of Naive Regression, which is the application of linear regression to a categorical outcome, treating the outcome as numerical. He asserted that predictions from Naive Regression will be quite good. My last post was about the "goodness" of a linear regression applied to a binary outcome in terms of the estimated coefficients. That's what explanatory modeling is about. What Dr. Tsao alerted me to, is that the predictions (or more correctly, classifications) too, will be good. In other words, it's useful for predictive modeling! In his words:
"This naivity is not blessed from current statistical or machine learning theory. However, surprisingly, it delivers good or satisfactory performances in many applications."
Note that to derive a classification from naive regression, you treat the prediction as the class probability (although it might be negative or >1), and apply a cutoff value as in any other classification method.

Dr. Tsao pointed me to the good old The Elements of Statistical Learning, which has a section called Linear Regression of an Indicator Matrix. There are two interesting takeaway from Dr. Tsao's talk:
  1. Naive Regression and Linear Discriminant Analysis will have the same ROC curve, meaning that the ranking of predictions will be identical.
  2. If the two groups are of equal size (n1=n2), then Naive Regression and Discriminant Analysis are equivalent and therefore produce the same classifications.


Will Dwinnell said...

It is not clear to me why this application of linear models is considered by you "naïve". Actually, use of linear probability models is well-studied and has a history going back to at least the 1960s. See, for instance, "Linear Probability, Logit and Probit Models" by Aldrich and Nelson, copyright 1984.

Galit Shmueli said...

Hi Will,
Thanks for this comment and for the cool reference (many parts available on Amazon Look-Inside!).

Indeed, the Linear Probability Model (LPM) is old, but here are the two relatively unknown points, at least in my circles:

1. What about LPM predictions? Literature on LPM talks about the properties of the beta coefficients but not about the predictions (Aldrich and Nelson touch upon this briefly in Section 1.2.3). This was the motivation for this particular post. Data miners love the term "naive", so when it comes to prediction, they call LPM "naive".

2. What about LPM with large samples? Researchers, and especially statisticians, tend to discourage the LPM altogether. But what happens in the realm of very large samples? Aldrich and Nelson say "estimates of the sampling variances will not be correct, and any hypothesis tests or confidence intervals based on these sampling variances will be invalid, even for very large samples" (p. 13-14). They propose a 2-step weighted-least-squares solution. However, I think a much simpler solution is bootstrap, and with a large sample you can even do "rich man's bootstrap", by using lots of subsamples instead of resampling.