Tuesday, March 10, 2009

What R-squared is (and is not)

R-squared (aka "coefficient of determination", or for short, R2) is a popular measure used in linear regression to assess the strength of the linear relationship between the inputs and the output. In a model with a single input, R2 is simply the squared correlation coefficient between the input and output.

If you examine a few textbooks in statistics or econometrics, you will find several definitions of R2. The most common definition is "the percent of variation in the output (Y) explained by the inputs (X's)". Another definition is "a measure of predictive power" (check out Wikepedia!). And finally, R2 is often called a goodness-of-fit measure. Try a quick Google search of "R-squared" and "goodness of fit". I even discovered this Journal of Economics article entitled An R-squared measure of goodness of fit for some common nonlinear regression models.

The first definition is correct, although it might sound overly complicated to a non-statistical ear. Nevertheless, it is correct.

As to R2 being a predictive measure, this is an unfortunately popular misconception. There are several problems with R2 that make it a poor predictive accuracy measure:
  1. R2 always increases as you add inputs, whether they contain useful information or not. This technical inflation in R2 is usually overcome by using an alternative metric (R2-adjusted), which penalized R2 for the number of inputs included.

  2. R2 is computed from a given sample of data that was used for fitting the linear regression model. Hence, it is "biased" towards those data and is therefore likely to be over-optimistic in measuring the predictive power of the model on new data. This is part of a larger issue related to performance evaluation metrics: the best way to assess the predictive power of a model is to test it on new data. To see more about this, check out my recent working paper "To Explain or To Predict?"
Finally, the popular labeling of R2 as a goodness-of-fit measure is, in fact, incorrect. This was pointed out to me by Prof. Jeff Simonoff from NYU. R2 does not measure how well the data fit the linear model but rather how strong the linear relationship is. Jeff calls it a strength-of-fit measure.

Here's a cool example (thanks Jeff!): If you simulate two columns of uncorrelated normal variables and then fit a regression to the resulting pairs (call them X and Y), you will get a very low R2 (practically zero). This indicates that there is no linear relationship between X and Y. However, the model being fitted is actually a regression of Y on X with a slope of zero. In that sense, the data do fit the zero-slope model very well, yet R2 tells us nothing of this good fit.

No comments: