BzST | Business Analytics, Statistics, Teaching: multiple testing

Wednesday, September 07, 2011

Multiple testing with large samples

Multiple testing (or multiple comparisons) arises when multiple hypotheses are tested using the same dataset via statistical inference. If each test has false alert level α, then the combined false alert rate of testing k hypotheses (also called the "overall type I error rate") can be as large as 1-(1-α)^k (exponential in the number of hypotheses k). This is a serious problem and ignoring it can lead to false discoveries. See an earlier post with links to examples.

There are various proposed corrections for multiple testing, the most basic principle being reducing the individual α's. However, the various corrections suffer in this way or the other from reducing statistical power (the probability of detecting a real effect). One important approach is to limit the number of hypotheses to be tested. All this is not new to statisticians and also to some circles of researchers in other areas (a 2008 technical report by the US department of education nicely summarizes the issue and proposes solutions for education research).

"Large-Scale" = many measurements

The multiple testing challenge has become especially prominent in analyzing micro-array genomic data, where datasets have measurements on many genes (k) for a few people (n). In this new area, inference is used more in an exploratory fashion, rather than confirmatory. The literature on "large-k-small-n" problems has also grown considerably since, including a recent book Large-Scale Inference by Bradley Efron.

And now I get to my (hopefully novel) point: empirical research in the social sciences is now moving to the era of "large n and same old k" datasets. This is what I call "large samples". With large datasets becoming more easily available, researchers test few hypotheses using tens and hundreds of thousands of observations (such as lots of online auctions on eBay or many books on Amazon). Yet, the focus has remained on confirmatory inference, where a set of hypotheses that are derived from a theoretical model are tested using data. What happens to multiple testing issues in this environment? My claim is that they are gone! Decrease α to your liking, and you will still have more statistical power than you can handle.

But wait, it's not so simple: With very large samples, the p-value challenge kicks in, such that we cannot use statistical significance to infer practically significant effects. Even if we decrease α to a tiny number, we'll still likely get lots of statistically-significant-but-practically-meaningless results.

The bottom line is that with large samples (large-n-same-old-k), the approach to analyzing data is totally different: no need to worry about multiple testing, which is so crucial in small samples. This is only one among many other differences between small-sample and large-sample data analysis.

Wednesday, March 07, 2007

Multiple Testing

My colleague Ralph Russo often comes up with memorable examples for teaching complicated concepts. He recently sent me an Economist article called "Signs of the Times" that shows the absurd results that can be obtained if multiple testing is not taken into account.

Multiple testing arises when the same data are used simultaneously for testing many hypotheses. The problem is a huge inflation in the type I error (i.e., rejecting the null hypothesis in error). Even if each single hypothesis is carried out at a low significance level (e.g., the infamous 5% level), the aggregate type I error becomes huge very fast. In fact, if testing k hypotheses that are independent of each other, each at significance level alpha, then the total type I error is 1-(1-alpha)^k. That's right - it grows exponentially. For example, if we test 7 independent hypotheses at a 10% significance level, the overall type I error is 52%. In other words, even if none of these hypotheses are true, we will see on average more than half of the p-values below 10%.

In the Economist article, Dr. Austin tests a set of multiple absurd "medical" hypotheses (such as "people born under the astrological sign of Leo are 15% more likely to be admitted to hospital with gastric bleeding than those born under the other 11 signs"). He shows that some of these hypotheses are "supported by the data", if we ignore multiple testing.

There is a variety of solutions for multiple testing, some older (such as the classic Bonferonni correction) and some more recent (such as the False Discovery Rate). But most importantly, this issue should be recognized.