"This naivity is not blessed from current statistical or machine learning theory. However, surprisingly, it delivers good or satisfactory performances in many applications."Note that to derive a classification from naive regression, you treat the prediction as the class probability (although it might be negative or >1), and apply a cutoff value as in any other classification method.
Dr. Tsao pointed me to the good old The Elements of Statistical Learning, which has a section called Linear Regression of an Indicator Matrix. There are two interesting takeaway from Dr. Tsao's talk:
- Naive Regression and Linear Discriminant Analysis will have the same ROC curve, meaning that the ranking of predictions will be identical.
- If the two groups are of equal size (n1=n2), then Naive Regression and Discriminant Analysis are equivalent and therefore produce the same classifications.
It is not clear to me why this application of linear models is considered by you "naïve". Actually, use of linear probability models is well-studied and has a history going back to at least the 1960s. See, for instance, "Linear Probability, Logit and Probit Models" by Aldrich and Nelson, copyright 1984.
Thanks for this comment and for the cool reference (many parts available on Amazon Look-Inside!).
Indeed, the Linear Probability Model (LPM) is old, but here are the two relatively unknown points, at least in my circles:
1. What about LPM predictions? Literature on LPM talks about the properties of the beta coefficients but not about the predictions (Aldrich and Nelson touch upon this briefly in Section 1.2.3). This was the motivation for this particular post. Data miners love the term "naive", so when it comes to prediction, they call LPM "naive".
2. What about LPM with large samples? Researchers, and especially statisticians, tend to discourage the LPM altogether. But what happens in the realm of very large samples? Aldrich and Nelson say "estimates of the sampling variances will not be correct, and any hypothesis tests or confidence intervals based on these sampling variances will be invalid, even for very large samples" (p. 13-14). They propose a 2-step weighted-least-squares solution. However, I think a much simpler solution is bootstrap, and with a large sample you can even do "rich man's bootstrap", by using lots of subsamples instead of resampling.
Post a Comment