## Wednesday, June 11, 2008

### Summaries or graphs?

Herb Edelstein from Two Crows consulting introduced me to this neat example showing how graphs are much more revealing than summary statistics. This is an age-old example by Anscombe (1973). I will show a slightly updated version of Anscombe's example, by Basset et al. (1986):
We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values.

Here are the summary statistics for each of the four Y variables (A, B, C, D):
 A B C D Average 20.95 20.95 20.95 20.95 Std 1.495794 1.495794 1.495794 1.495794

That's right - the mean and standard deviations are all identical. Now let's go one step further and fit the four simple linear regression models Y= a + bX + noise. Remember, the X is the same in all four datasets. Here is the output for the first dataset:
 Regression Statistics Multiple R 0.620844098 R Square 0.385447394 Adjusted R Square 0.317163771 Standard Error 1.236033081 Observations 11
 Coefficients Standard Error t Stat P-value Intercept 18.43 1.12422813 16.39347 5.2E-08 slope 0.28 0.11785113 2.375879 0.041507

Guess what? The other three regression outputs are identical!

So are the four Y variables identical???   