Summaries or graphs?

Herb Edelstein from Two Crows consulting introduced me to this neat example showing how graphs are much more revealing than summary statistics. This is an age-old example by Anscombe (1973). I will show a slightly updated version of Anscombe's example, by Basset et al. (1986):
We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values.

Here are the summary statistics for each of the four Y variables (A, B, C, D):

	A	B	C	D
Average	20.95	20.95	20.95	20.95
Std	1.495794	1.495794	1.495794	1.495794

That's right - the mean and standard deviations are all identical. Now let's go one step further and fit the four simple linear regression models Y= a + bX + noise. Remember, the X is the same in all four datasets. Here is the output for the first dataset:

Regression Statistics
Multiple R	0.620844098
R Square	0.385447394
Adjusted R Square	0.317163771
Standard Error	1.236033081
Observations	11

	Coefficients	Standard Error	t Stat	P-value
Intercept	18.43	1.12422813	16.39347	5.2E-08
slope	0.28	0.11785113	2.375879	0.041507

Guess what? The other three regression outputs are identical!

So are the four Y variables identical???

Well, here is the answer:

To top it off, Basset included one more dataset that has the exact same summary stats and regression estimates. Here is the scatterplot:

[You can find all the data for both Anscombe's and Basset et al.'s examples here]

BzST | Business Analytics, Statistics, Teaching

Wednesday, June 11, 2008

Summaries or graphs?

1 comment: