Friday, April 28, 2006

p-values in LARGE datasets

We had an interesting discussion in our department today, the result of confining statisticians and non-statisticians in a maize-like building. Our colleague who called himself "non-stat-guru" sent a query to us "stat-gurus" (his labels) regarding p-values in a model that is estimated from a very large dataset.

The problem: a cetain statistical model was fit to 120,000 observations (that's right, n=120K). And obviously, all p-values for all predictors turned out to be highly statistically significant.

Why does this happen and what does it mean?
When the number of observations is very large, standard errors of estimates become very small: a simple example is the standard error of the mean which is equal to std/sqrt(n) . Plug 1 million in that denominator! This means that the model has power to detect even miniscule changes.

For instance, say we want to test whether the average population IQ is 100 (remember that IQ scores are actually calibrated so that the average is 100...). We take a sample of 1 million people, measure their IQ and compute the mean and standard deviation. The null hypothesis is

H0: population mean (mu) = 100
H1: mu NOT 100

The test statistic is: T = {sample mean - 100 } / {sample std / sqrt(n)}

the n=1,000,000 inflates the numerator of the T statistic and will make it statistically significant for even a sample mean of 100.000000000001. But is such a different practically significant??? Of course not.

The problem, in short, is that in large datasets statistical significance is likely to diverge from practical significance.

What can be done?

1. Assess the magnitude of the coefficients themselves and what their interpretation is. Their practical significance might be low. For example, in a model for cigarette box demand in a neighborhood grocery store, such as demand = a + b price, we might find a coefficient of b=0.000001 to be statistically significant (if we have enough observations). But what does it mean? An increase of $1 in price is associated with an average increase of 0.000001 in the number of cigerette boxes sold. Is this relevant?

2. Take a random sample and perform the analysis on that. You can use the remaining data to test the robustness of the model.

Next time before driving your car, make sure that your windshield was not replaced with a magnifying glass (unless you want to detect every ant on the road).

Wednesday, April 26, 2006

Symposium on Statistical Challenges in eCommerce Research

The second symposium on statistical chellnges in eCommerce will take place at the Carlson School of Management, University of Minnesota, May 22-23. For further details see

This symosium follows up the inaugural event held at the R. H. Smith School of Management of the University of Maryland last year, which brought together almost 100 researchers from the fields of information systems, statistics, data mining, marketing, and more. It was a stimulating event with lots of energy.

Last year's event lead to collaborations, discussions, and a special issue of the high-imparct journal Statistical Science which should be out in May or August.

There is still a short time to submit an abstract (work in progress is welcome!)

Tuesday, April 18, 2006

Interactive visualization of data

Two interesting articles describe how interactive visualization tools can be used for deriving insight from large business datasets. They both describe tools developed by the Human-Computer Interaction Lab at the University of Maryland. For James Bond fans, this place reminds me of Q branch -- they come up with amazingly cool visualization tools that save your day.

The first article "Describing Business Intelligence Using Treemap Visualizations" by Ben Shneiderman describes Treemap, a tool for visualizing hierarchical data. Know's "Map of the Market"? Guess where that came from!

The second article "The Surest Path to Visual Discovery" by Stephen Few describes Timesearcher, an interactive visualization tool for time series data, such as stocks. For full disclosure, I've been involved in the development of the current version in adapting it to eCommerce-type data and in particular auction data such as those from eBay. Timesearcher2 is capable of displaying tightly-coupled times series and cross-sectional data (ever seen anything like it?) That means that each auction consists of a time-series describing the bid history but also cross-sectional data such as the seller id and rating, item category, etc. For more details on the auction implementation check out our paper Exploring Auction Databases Through Interactive Visualization.

You can't escape Bayes...

My students are currently studying for a quiz on classification. One of the classifiers that we talked about is the Naive Bayes classifier. On Saturday evening I received a terrific email from Jason Madhosingh, one of my students. He writes:

So, I'm taking a break from studying this evening by watching "numb3rs", a
CBS crime drama where a mathematician uses math to help solve crimes (go
figure). Of course... he brings us Bayes theorem. There truly is no escape!

And as a visualization junkie, I was also thrilled about his last comment "They also did a 3D scatterplot"!

I guess I'll have to check out this series - does anyone have it taped? From the CBS website it looks like I can even get early previews as a teacher...

Teachers can opt in to order a Teaching Kit including a specially designed classroom poster, and can view the new classroom activities coordinated with each show episode on this website, a week prior to the show.

Monday, April 10, 2006

Patenting predictive models?

A curious sentence in a short BusinessWeek report sent me hunting for clues. In Rep of a (Drug) Salesman, a consulting firm by the name of TargetRx "claims it can identify what really makes a sales rep effective".

From a press release on TargetRx's website I found the following:

Data collected from physicians via survey are then merged with actual prescribing and other behavioral data and analyzed using proprietary analytic methods to develop predictive models of physician prescribing behavior. The proprietary analytics are based in part on TargetRx's patent-pending Method and
System for Analyzing the Effectiveness of Marketing Strategies. TargetRx
received notice of allowance on this patent application from the U.S. Patent and
Trademark Office in January 2006. The unique method of collecting and analyzing
data enables TargetRx to predict prescribing changes as well as decompose
prescribing to understand what specific aspects of the promotion, product, or
physicians' interactions with patients and payors are causing changes.

I was not able to find any further details. The question is what here is proprietary? What does "unique method of collecting and analyzing" mean? Conducting surveys is not new, so that's not it. Linking the two data sources (survey data and prescribing data) doesn't sound too hard, if there is a unique identifier for each doctor. Has TargetRX developed a new predictive method???

Apropos sales reps and prescribing doctors, an interesting paper1 by Rubin and Waterman, professors of statistics from Harvard and UPenn, discusses the use of "propensity scores" for evaluating causal effects such as marketing interventions. Unlike regression models that cannot prove causality, propensity scores compare matched observations, where matching is based on the multivariate profile of each observation. The application in the paper is a model that ranks the most likely doctors that will increase their prescribing due to a sales rep visit. An ordinary regression model that compares the number of prescriptions of doctors who were visited by sales reps and those who did not cannot account for phenomena such as: sales reps prefer visiting high-prescribing doctors because their compensation is based on the number of prescriptions!

1 "Estimating the Causal Effects of Marketing Interventions Using Propensity Score Methodology", D B Rubin and R P Waterman (2006), Statistical Science, special issue on "Statistical Challenges and Opportunities in eCommerce", forthcoming.

Data mining and privacy

BusinessWeek touched upon a sensitive issue in the article If You're Cheating on Your Taxes. It's about federal and state agencies using data mining to find "the bad guys".

Although "data mining" is the term used in many of these stories, a more careful look reveals that there are hardly any advanced statistical/DM methods involved. The issue is the linkage/matching of different data sources. In the Statistical Challenges & Opportunities in eCommerce symposium last year, Stephen Fienberg, a professor of statistics at CMU and an expert on disclosure limitation showed a semi-futuristic movie on a pizza parlor using an array of linked datasets to "customize" a delivery call (from the American Civil Liberties Union website). He also wrote a paper on privacy and data mining1 that will come out soon in a special issue of the journal Statistical Science on the same topic (OK, I'll disclose that I co-edited this with Wolfgang Jank).

Another interesting document can be found on the American Statistical Association's website: FAQ Regarding the Privacy Implications of Data Mining.

The bottom line is that statistical/data mining methods or tools are not the evil. In fact, in some cases statistical methods allow the exact opposite: disclosing data in a way that allows inference but conceals any information that might breach privacy. This area is called Disclosure Limitation and is studied mainly by statisticians, operations researchers, and computer scientists.

1 "Privacy and Confidentiality in an E-Commerce World: Data Mining, Data Warehousing, Matching, and Disclosure Limitation," S E Fienberg (2006), Statistical Science, special issue on "Statistical Challenges and Opportunities in eCommerce", forthcoming.

More on predictive vs. explanatory models

This week the predictive vs. explanatory modeling came up in multiple occasions: First, in a study with an information systems colleague where the goal is to build a predictive application for ranking the most-likely auctions to transact; Then, an example that I gave in class of modeling eBay data in to distinguish competitive from non-competitive auctions. And then, a bunch of conversations with students that followed.

The point that I want to make here, which I did not mention directly in my previous post on this subject, is that the set of PREDICTORS your model will include can be very different if the goal is explanatory vs. predictive. Here's the eBay example: we have data on a set of auctions from eBay (from publicly available data on For each auction there is information on the product features (e.g., category, new/used), seller's features (e.g., rating), and auction features (e.g., duration, opening price, closing price).

Explanatory goal: To determine factors that lead auctions to be competitive (i.e., receive more than 1 bid).

Predictive goal: To build a seller-side application that will predict the chances that his/her auction will be competitive.

In the explanatory task, we are likely to include the closing price, hypothesizing that (perhaps) lower priced items are more likely to be competitive. However, for the predictive model we cannot include closing price, because it is not known at the start of the auction! In other words, we are constrained to information that is available at the time of prediction.

Until now I have not found a published focused discussion on predictive modeling vs. building explanatory models. Statistics books tend to focus on explanatory models, whereas machine-learning sources focus on predictive modeling. Has anyone seen such a discussion?