Tuesday, June 10, 2008

Summaries or graphs?

1 comment:
Herb Edelstein from Two Crows consulting introduced me to this neat example showing how graphs are much more revealing than summary statistics. This is an age-old example by Anscombe (1973). I will show a slightly updated version of Anscombe's example, by Basset et al. (1986):
We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values.

Here are the summary statistics for each of the four Y variables (A, B, C, D):

A B C D
Average 20.95 20.95 20.95 20.95
Std 1.495794 1.495794 1.495794 1.495794


That's right - the mean and standard deviations are all identical. Now let's go one step further and fit the four simple linear regression models Y= a + bX + noise. Remember, the X is the same in all four datasets. Here is the output for the first dataset:
Regression Statistics
Multiple R 0.620844098
R Square 0.385447394
Adjusted R Square 0.317163771
Standard Error 1.236033081
Observations 11



Coefficients Standard Error t Stat P-value
Intercept 18.43 1.12422813 16.39347 5.2E-08
slope 0.28 0.11785113 2.375879 0.041507

Guess what? The other three regression outputs are identical!

So are the four Y variables identical???


Well, here is the answer:


To top it off, Basset included one more dataset that has the exact same summary stats and regression estimates. Here is the scatterplot:


[You can find all the data for both Anscombe's and Basset et al.'s examples here]

Resources for instructors of data mining courses in b-schools

No comments:
With the increasing popularity of data mining courses being offered in business schools (at the MBA and undergraduate levels), a growing number of faculty are becoming involved. Instructors come from diverse backgrounds: statistics, information systems, machine learning, management science, marketing, and more.

Since our textbook Data Mining for Business Intelligence has been out, I've received requests from many instructors to share materials, information, and other resources. At last, I have launched BZST Teaching, a forum for instructors teaching data mining in b-schools. The forum is open only to instructors and a host of materials and communication options are available. It is founded on the idea of sharing with the expectation that members will contribute to the pool.

If you are interested in joining, please email me directly.

Friday, June 06, 2008

Student network launched!

No comments:
I recently launched a forum for the growing population of MBA students who are veterans of our data mining course. The goal of the forum is to foster networking, job-related communications (not too many data mining-savvy MBAs out there!), to share interesting data analytic stories, and to keep in touch (where are you today?).