We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values.

Here are the summary statistics for each of the four Y variables (A, B, C, D):

A | B | C | D | |

Average | 20.95 | 20.95 | 20.95 | 20.95 |

Std | 1.495794 | 1.495794 | 1.495794 | 1.495794 |

That's right - the mean and standard deviations are all identical. Now let's go one step further and fit the four simple linear regression models Y= a + bX + noise. Remember, the X is the same in all four datasets. Here is the output for the first dataset:

Regression Statistics | |

Multiple R | 0.620844098 |

R Square | 0.385447394 |

Adjusted R Square | 0.317163771 |

Standard Error | 1.236033081 |

Observations | 11 |

Coefficients | Standard Error | t Stat | P-value | |

Intercept | 18.43 | 1.12422813 | 16.39347 | 5.2E-08 |

slope | 0.28 | 0.11785113 | 2.375879 | 0.041507 |

Guess what? The other three regression outputs are identical!

So are the four Y variables identical???

Well, here is the answer:

To top it off, Basset included one more dataset that has the exact same summary stats and regression estimates. Here is the scatterplot:

[You can find all the data for both Anscombe's and Basset et al.'s examples here]

## 1 comment:

Slides 3-4 from a presentation of Gramener show another neat example (with data from India) of the same phenomenon:

http://blog.gramener.com/84/workshop-on-visual-analytics-slides

Post a Comment