Friday, April 28, 2006

p-values in LARGE datasets

We had an interesting discussion in our department today, the result of confining statisticians and non-statisticians in a maize-like building. Our colleague who called himself "non-stat-guru" sent a query to us "stat-gurus" (his labels) regarding p-values in a model that is estimated from a very large dataset.

The problem: a cetain statistical model was fit to 120,000 observations (that's right, n=120K). And obviously, all p-values for all predictors turned out to be highly statistically significant.

Why does this happen and what does it mean?
When the number of observations is very large, standard errors of estimates become very small: a simple example is the standard error of the mean which is equal to std/sqrt(n) . Plug 1 million in that denominator! This means that the model has power to detect even miniscule changes.

For instance, say we want to test whether the average population IQ is 100 (remember that IQ scores are actually calibrated so that the average is 100...). We take a sample of 1 million people, measure their IQ and compute the mean and standard deviation. The null hypothesis is

H0: population mean (mu) = 100
H1: mu NOT 100

The test statistic is: T = {sample mean - 100 } / {sample std / sqrt(n)}

the n=1,000,000 inflates the numerator of the T statistic and will make it statistically significant for even a sample mean of 100.000000000001. But is such a different practically significant??? Of course not.

The problem, in short, is that in large datasets statistical significance is likely to diverge from practical significance.

What can be done?

1. Assess the magnitude of the coefficients themselves and what their interpretation is. Their practical significance might be low. For example, in a model for cigarette box demand in a neighborhood grocery store, such as demand = a + b price, we might find a coefficient of b=0.000001 to be statistically significant (if we have enough observations). But what does it mean? An increase of $1 in price is associated with an average increase of 0.000001 in the number of cigerette boxes sold. Is this relevant?

2. Take a random sample and perform the analysis on that. You can use the remaining data to test the robustness of the model.

Next time before driving your car, make sure that your windshield was not replaced with a magnifying glass (unless you want to detect every ant on the road).


Anonymous said...

Very nice article! The general idea of "practical relevance" versus "statistical significance" works actually also the other way round: There could be influences which are not significant in a statistical sense because the variance is too high but which are practically relevant. I had such a case in a simulation purification plant experiment once. A look at some boxplots did reveal a visible difference in the measures grouped by non-treatment, treatment but in the regression model it was not significant on a 10%-level.
So it is really time to have some more ideas on this topic!

Galit Shmueli said...

Thanks for the terrific response and example. Indeed the "statistical vs. practical significance" is an age-old issue. I think that your angle used to be the more popular one, and at least then you could recommend the collection of additional data to increase statistical power. These days we have TOO MUCH data. And then do we recommend getting rid of some data? (:

Corey Angst said...

Thanks for posting my question. Your explanation was very practical and also very informative - even for us non-stats-gurus. I also like your follow up question. So do I nix 90k cases in this wonderful 120k dataset? If so, what is the magic number ...i.e do I keep culling until I finally reach non-significance? Just for argument sake, I did run only 30k cases (random sample) and I still get lots of significance but I actually show values that move away from p=.000.
Great discussion!

Galit Shmueli said...

Thanks Corey. The "magic" number can also be found by deciding what a practical significance is. Then you can compute the sample size that would detect differences of (at least) that magnitude.

This "power computation" gives you the minimal sample size that is needed to detect an effect of the given magnitude. It will give you an idea of the magnitude of n that gives "reasonable" p-values.

In the IQ example from my origianl post, if I wanted to detect changes from 100 that are larger than, say, 1 IQ point (rather than 0.000001 IQ points) with 95% confidence, then we want

1 / (std/sqrt(n)) > 2

because 2 gives approx 95% confidence. If the estimated std is 15 (it's IQ after all), then we'd need n > 900.

Of course when you move to more complicated models it is harder to do this type of computation. There are some rules of thumb like n > 10 times the number of predictors in linear regression. Just from gut feeling, it sounds like 30,000 observations is still a ton for a standard model.

Finally, it's not that you are throwing away all the rest of the data. You can use it, like Wolfgang mentioned, to check robustness: run the model on several subsets of the data and see that you get the same story.