## Tuesday, June 10, 2008

### Summaries or graphs?

Herb Edelstein from Two Crows consulting introduced me to this neat example showing how graphs are much more revealing than summary statistics. This is an age-old example by Anscombe (1973). I will show a slightly updated version of Anscombe's example, by Basset et al. (1986):
We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values.

Here are the summary statistics for each of the four Y variables (A, B, C, D):
 A B C D Average 20.95 20.95 20.95 20.95 Std 1.495794 1.495794 1.495794 1.495794

That's right - the mean and standard deviations are all identical. Now let's go one step further and fit the four simple linear regression models Y= a + bX + noise. Remember, the X is the same in all four datasets. Here is the output for the first dataset:
 Regression Statistics Multiple R 0.620844098 R Square 0.385447394 Adjusted R Square 0.317163771 Standard Error 1.236033081 Observations 11
 Coefficients Standard Error t Stat P-value Intercept 18.43 1.12422813 16.39347 5.2E-08 slope 0.28 0.11785113 2.375879 0.041507

Guess what? The other three regression outputs are identical!

So are the four Y variables identical???

To top it off, Basset included one more dataset that has the exact same summary stats and regression estimates. Here is the scatterplot:

[You can find all the data for both Anscombe's and Basset et al.'s examples here]

### Resources for instructors of data mining courses in b-schools

With the increasing popularity of data mining courses being offered in business schools (at the MBA and undergraduate levels), a growing number of faculty are becoming involved. Instructors come from diverse backgrounds: statistics, information systems, machine learning, management science, marketing, and more.

Since our textbook Data Mining for Business Intelligence has been out, I've received requests from many instructors to share materials, information, and other resources. At last, I have launched BZST Teaching, a forum for instructors teaching data mining in b-schools. The forum is open only to instructors and a host of materials and communication options are available. It is founded on the idea of sharing with the expectation that members will contribute to the pool.