Wednesday, September 07, 2011

Multiple testing with large samples

Multiple testing (or multiple comparisons) arises when multiple hypotheses are tested using the same dataset via statistical inference. If each test has false alert level α, then the combined false alert rate of testing k hypotheses (also called the "overall type I error rate") can be as large as 1-(1-α)^k (exponential in the number of hypotheses k). This is a serious problem and ignoring it can lead to false discoveries. See an earlier post with links to examples.

There are various proposed corrections for multiple testing, the most basic principle being reducing the individual α's. However, the various corrections suffer in this way or the other from reducing statistical power (the probability of detecting a real effect). One important approach is to limit the number of hypotheses to be tested. All this is not new to statisticians and also to some circles of researchers in other areas (a 2008 technical report by the US department of education nicely summarizes the issue and proposes solutions for education research).

"Large-Scale" = many measurements
The multiple testing challenge has become especially prominent in analyzing micro-array genomic data, where datasets have measurements on many genes (k) for a few people (n). In this new area, inference is used more in an exploratory fashion, rather than confirmatory. The literature on "large-k-small-n" problems has also grown considerably since, including a recent book Large-Scale Inference by Bradley Efron.

And now I get to my (hopefully novel) point: empirical research in the social sciences is now moving to the era of "large n and same old k" datasets. This is what I call "large samples". With large datasets becoming more easily available, researchers test few hypotheses using tens and hundreds of thousands of observations (such as lots of online auctions on eBay or many books on Amazon). Yet, the focus has remained on confirmatory inference, where a set of hypotheses that are derived from a theoretical model are tested using data. What happens to multiple testing issues in this environment? My claim is that they are gone! Decrease α to your liking, and you will still have more statistical power than you can handle.

But wait, it's not so simple: With very large samples, the p-value challenge kicks in, such that we cannot use statistical significance to infer practically significant effects. Even if we decrease α to a tiny number, we'll still likely get lots of statistically-significant-but-practically-meaningless results.

The bottom line is that with large samples (large-n-same-old-k), the approach to analyzing data is totally different: no need to worry about multiple testing, which is so crucial in small samples. This is only one among many other differences between small-sample and large-sample data analysis.


No comments: