Wednesday, June 11, 2008

Summaries or graphs?

Herb Edelstein from Two Crows consulting introduced me to this neat example showing how graphs are much more revealing than summary statistics. This is an age-old example by Anscombe (1973). I will show a slightly updated version of Anscombe's example, by Basset et al. (1986):
We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values.

Here are the summary statistics for each of the four Y variables (A, B, C, D):

A B C D
Average 20.95 20.95 20.95 20.95
Std 1.495794 1.495794 1.495794 1.495794


That's right - the mean and standard deviations are all identical. Now let's go one step further and fit the four simple linear regression models Y= a + bX + noise. Remember, the X is the same in all four datasets. Here is the output for the first dataset:
Regression Statistics
Multiple R 0.620844098
R Square 0.385447394
Adjusted R Square 0.317163771
Standard Error 1.236033081
Observations 11



Coefficients Standard Error t Stat P-value
Intercept 18.43 1.12422813 16.39347 5.2E-08
slope 0.28 0.11785113 2.375879 0.041507

Guess what? The other three regression outputs are identical!

So are the four Y variables identical???


Well, here is the answer:


To top it off, Basset included one more dataset that has the exact same summary stats and regression estimates. Here is the scatterplot:


[You can find all the data for both Anscombe's and Basset et al.'s examples here]

1 comment:

Galit Shmueli said...

Slides 3-4 from a presentation of Gramener show another neat example (with data from India) of the same phenomenon:
http://blog.gramener.com/84/workshop-on-visual-analytics-slides