Wednesday, June 11, 2008

Summaries or graphs?

Herb Edelstein from Two Crows consulting introduced me to this neat example showing how graphs are much more revealing than summary statistics. This is an age-old example by Anscombe (1973). I will show a slightly updated version of Anscombe's example, by Basset et al. (1986):
We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values.

Here are the summary statistics for each of the four Y variables (A, B, C, D):

A B C D
Average 20.95 20.95 20.95 20.95
Std 1.495794 1.495794 1.495794 1.495794


That's right - the mean and standard deviations are all identical. Now let's go one step further and fit the four simple linear regression models Y= a + bX + noise. Remember, the X is the same in all four datasets. Here is the output for the first dataset:
Regression Statistics
Multiple R 0.620844098
R Square 0.385447394
Adjusted R Square 0.317163771
Standard Error 1.236033081
Observations 11



Coefficients Standard Error t Stat P-value
Intercept 18.43 1.12422813 16.39347 5.2E-08
slope 0.28 0.11785113 2.375879 0.041507

Guess what? The other three regression outputs are identical!

So are the four Y variables identical???


Well, here is the answer:


To top it off, Basset included one more dataset that has the exact same summary stats and regression estimates. Here is the scatterplot:


[You can find all the data for both Anscombe's and Basset et al.'s examples here]

Tuesday, June 10, 2008

Resources for instructors of data mining courses in b-schools

With the increasing popularity of data mining courses being offered in business schools (at the MBA and undergraduate levels), a growing number of faculty are becoming involved. Instructors come from diverse backgrounds: statistics, information systems, machine learning, management science, marketing, and more.

Since our textbook Data Mining for Business Intelligence has been out, I've received requests from many instructors to share materials, information, and other resources. At last, I have launched BZST Teaching, a forum for instructors teaching data mining in b-schools. The forum is open only to instructors and a host of materials and communication options are available. It is founded on the idea of sharing with the expectation that members will contribute to the pool.

If you are interested in joining, please email me directly.

Friday, June 06, 2008

Student network launched!

I recently launched a forum for the growing population of MBA students who are veterans of our data mining course. The goal of the forum is to foster networking, job-related communications (not too many data mining-savvy MBAs out there!), to share interesting data analytic stories, and to keep in touch (where are you today?).

Sunday, June 01, 2008

Weighted nearest-neighbors

K-nearest neighbors (k-NN) is a simple yet often powerful classification / prediction method. The basic idea, for predicting a new observation, is to find the k most similar observations in terms of the predictor (X) values, and then let those k neighbors vote to determine the predicted class membership (or take their Y average to predict their numerical outcome). Since this is such an intuitive method, I thought it would be useful to discuss two improvements that have been suggested by data miners. Both use weighting, but in different ways.

One intuitive improvement is to weight the neighbors by their proximity to the observation of interest. In other words, rather than giving each neighbor equal importance in the vote (or average), closer neighbors have higher impact on the prediction.

A second way to use weighting to improve the predictive performance of k-NN is related to predictors: In ordinary k-NN predictors are typically brought to the same scale by normalization, and then treated equally for the purpose of determining proximities of observations. An improvement is therefore to weight the predictors according to their predictive power, such that higher importance is given to more informative predictors. The question is how to assign the weights, or in other words, how to assign predictive power scores to the different predictors. There are a variety of papers out there suggesting different methods. The main approach is to use a different classification/prediction method that yield predictor importance measures (e.g., logistic regression), and then to use those measures in constructing the predictor weights within k-NN.