BzST | Business Analytics, Statistics, Teaching: Large samples

Showing posts with label Large samples. Show all posts

Monday, November 06, 2017

Statistical test for "no difference"

To most researchers and practitioners using statistical inference, the popular hypothesis testing universe consists of two hypotheses:

H0 is the null hypothesis of "zero effect"

H1 is the alternative hypothesis of "a non-zero effect"

The alternative hypothesis (H1) is typically what the researcher is trying to find: a different outcome for a treatment and control group in an experiment, a regression coefficient that is non-zero, etc. Recently, several independent colleagues have asked me if there's a statistical way to show that an effect is zero, or, that there's no difference between groups. Can we simply use the above setup? The answer is no. Can we simply reverse the hypotheses? Uh-uh, because the "equal" must be in H0.

Minitab has equivalence testing (from http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-i-what-are-you-trying-to-prove)

Here's why: In the classic setup the hypotheses are stated about the population of interest, and we take a sample from that population to test them. In this non-symmetrical setup, H0 is assumed to be true unless otherwise proven by the result in the sample. Hypothesis testing has its roots in Karl Popper's falsifiability principle, where any claim about the existence of an effect cannot be made unless it is first shown that a situation of no effect is untenable. This is similar to a democratic justice system, where the defendant is assumed not-guilty unless proven so. The burden of proof lies on the researcher/data. That's why we reject H0 with sufficient evidence or not reject H0, if we don't have sufficient evidence against it. This setup is not designed to arrive at a conclusion that H0 is true.

In a 2013 Letter to the Editor of Journal of Sports Sciences, titled Testing the null hypothesis: the forgotten legacy of Karl Popper? Mick Wilkinson suggests that this setup is the opposite of what a researcher should be doing according to the scientific method, and in fact "Our work should remain driven by conjecture and attempted falsification such that it is always the null hypothesis that is tested. The write up of our studies should make it clear that we are indeed testing the null hypothesis and conforming to the established and accepted philosophical conventions of the scientific method." He therefore suggests the following sequence:

null hypothesis tests are carried out to first establish that a population effect is in fact unlikely to be zero
a confidence-interval based approach estimates what the magnitude of effect might plausibly be
a probability associated with the likelihood of the population effect exceeding an apriori smallest meaningful effect is calculated

While this provides a relevant criticism of the hypothesis testing paradigm, it does not directly provide a test of equivalence! The good news is that equivalence testing is a popular such scenario in pharmacokinetics, arising, for example, when a pharmaceutical wants to show that their developed generic drug is equivalent to a brand drug. This is termed bioequivalence. In other words, H1 is "the drugs are equivalent". The approach used there is the following:

set up an equivalence bound that determines the smallest clinically-meaningful effect size of interest
calculate a confidence interval around the observed effect size (say, difference between the mean outcomes of the generic and brand drugs)
if the confidence interval includes the equivalence bound, then the groups are equivalent; otherwise they are not equivalent

The Wikipedia article on Equivalence Test points out two additional interesting uses of equivalence testing:

Avoiding misinterpretation of large p-values in ordinary testing as evidence for H0: "Equivalence tests can be performed in addition to null-hypothesis significance tests. This might prevent common misinterpretations of p-values larger than the alpha level as support for the absence of a true effect."
The confidence interval used in equivalence testing can help distinguish between statistical significance and practical/clinical significance: if it includes/excludes the value 0, that indicates statistical insignificance/significance between the groups (or an effect), while if it includes/excludes the equivalence bound that indicates practical insignificance/significance. The four options are shown in the figure.

Statistical vs. practical significance (from https://en.wikipedia.org/wiki/Equivalence_test)

How will sample size affect equivalence testing? We know that in ordinary hypothesis testing a sufficiently large sample will lead to detecting practically insignificant effects by generating a very small p-value, which is bad news for those relying on classic hypothesis testing! My colleague Foster Provost from NYU once challenged me how I could trust a statistical method that breaks down with large samples - a poignant thought that eventually led to my co-authored paper Too Big To Fail: Large Samples and the p-value Problem (Lin et al., ISR 2013). What about equivalence tests? In equivalence testing, a very large sample will behave properly: with more data we'll get narrower confidence intervals (more certainty). A practically insignificant difference will therefore generate a narrow confidence interval that excludes the equivalence boundary (=equivalence). In contrast, a practically significant difference will generate a narrow confidence interval that completely exceeds the equivalence boundary (non-equivalence).

Monday, May 28, 2012

Linear regression for a binary outcome: is it Kosher?

Regression models are the most popular tool for modeling the relationship between an outcome and a set of inputs. Models can be used for descriptive, causal-explanatory, and predictive goals (but in very different ways! see Shmueli 2010 for more).

The family of regression models includes two especially popular members: linear regression and logistic regression (with probit regression more popular than logistic in some research areas). Common knowledge, as taught in statistics courses, is: use linear regression for a continuous outcome and logistic regression for a binary or categorical outcome. But why not use linear regression for a binary outcome? the two common answers are: (1) the linear regression can produce predictions that are not binary, and hence "nonsense" and (2) inference based on the linear regression coefficients will be incorrect.

I admit that I bought into these "truths" for a long time, until I learned never to take any "statistical truth" at face value. First, let us realize that problem #1 relates to prediction and #2 to description and causal explanation. In other words, if issue #1 can be "fixed" somehow, then I might consider linear regression for prediction even if the inference is wrong (who cares about inference if I am only interested in predicting individual observations?). Similarly, if there is a fix for issue #2, then I might consider linear regression as a kosher inference mechanism even if it produces "nonsense" predictions.

The 2009 paper Linear versus logistic regression when the dependent variable is a dichotomy by Prof. Ottar Hellevik from Oslo University de-mystifies some of these issues. First, he gives some tricks that help avoid predictions outside the [0,1] range. The author identifies a few factors that contribute to "nonsense predictions" by linear regression:

interactions that are not accounted for in the regression
non-linear relationships between a predictor and the outcome

The suggested remedy for these issues is including interaction terms for categorical variables, and if numerical predictors are involved, then bucket them into bins and include those as dummies + interactions. So, if the goal is predicting a binary outcome, linear regression can be modified and used.

Now to the inference issue. "The problem with a binary dependent variable is that the homoscedasticity assumption (similar variation on the dependent variable for units with different values on the independent variable) is not satisfied... This seems to be the main basis for the widely held opinion that linear regression is inappropriate with a binary dependent variable". Statistical theory tells us that violating the homoscedasticity assumption results in biased standard errors for the coefficients, and that the coefficients might not be the most precise in terms of variance. Yet, the coefficients themselves remain unbiased (meaning that with a sufficiently large sample they are "on target"). Hence, with a sufficiently large sample we need not worry! Precision is not an issue in very large samples, and hence the on-target coefficients are just what we need.

I will add that another concern is that the normality assumption is violated: the residuals from a regression model on a binary outcome will not look very bell-shaped... Again, with a sufficiently large sample, the distribution does not make much difference, since the standard errors are so small anyway.

Chart from Hellevik (2009)

Hellevik's paper pushes the envelope further in an attempt to explore "how small can you go" with your sample before getting into trouble. He uses simulated data and compares the results from logistic and linear regression for fairly small samples. He finds that the differences are minuscule.

The bottom line: linear regression is kosher for prediction if you take a few steps to accommodate non-linear relationships (but of course it is not guaranteed to produce better predictions than logistic regression!). For inference, for a sufficiently large sample where standard errors are tiny anyway, it is fine to trust the coefficients, which are in any case unbiased.

Sunday, March 11, 2012

Big Data: The Big Bad Wolf?

"Big Data" is a big buzzword. I bet that sentiment analysis of news coverage, blog posts and other social media sources would show a strong positive sentiment associated with Big Data. What exactly is big data depends on who you ask. Some people talk about lots of measurements (what I call "fat data"), others of huge numbers of records ("long data"), and some talk of both. How much is big? Again, depends who you ask.

As a statistician who's (luckily) strayed into data mining, I initially had the traditional knee-jerk reaction of "just get a good sample and get it over with", and later recognizing that "fitting the data to the toolkit" (or, "to a hammer everything looks like a nail") is straight-jacketing some great opportunities.

The LinkedIn group Advanced Business Analytics, Data Mining and Predictive Modeling reacted passionately to a the question "What is the value of Big Data research vs. good samples?" posted by a statistician and analytics veteran Michael Mout. Respondents have been mainly from industry - statisticians and data miners. I'd say that the sentiment analysis would come out mixed, but slightly negative at first ("at some level, big data is not necessarily a good thing"; "as statisticians, we need to point out the disadvantages of Big Data"). Over time, sentiment appears to be more positive, but not reaching anywhere close to the huge Big Data excitement in the media.

I created a Wordle of the text in the discussion until today (size represents frequency). It highlights the main advantages and concerns of Big Data. Let me elaborate:

Big data permit the detection of complex patterns (small effects, high order interactions, polynomials, inclusion of many features) that are invisible with small data sets
Big data allow studying rare phenomena, where a small percentage of records contain an event of interest (fraud, security)
Sampling is still highly useful with big data (see also blog post by Meta Brown); with the ability to take lots of smaller samples, we can evaluate model stability, validity and predictive performance
Statistical significance and p-values become meaningless when statistical models are fitted to very large samples. It is then practical significance that plays the key role.
Big data support the use of algorithmic data mining methods that are good at feature selection. Of course, it is still necessary to use domain knowledge to avoid "garbage-in-garbage-out"
Such algorithms might be black-boxes that do not help understand the underlying relationship, but are useful in practice for predicting new records accurately
Big data allow the use of many non-parametric methods (statistical and data mining algorithms) that make much less assumptions about data (such as independent observations)

Thanks to social media, we're able to tap on many brains that have experience, expertise and... some preconceptions. The data collected from such forums can help us researchers to focus our efforts on the needed theoretical investigation of Big Data, to help move from sentiments to theoretically-backed-and-practically-useful knowledge.

Wednesday, September 07, 2011

Multiple testing with large samples

Multiple testing (or multiple comparisons) arises when multiple hypotheses are tested using the same dataset via statistical inference. If each test has false alert level α, then the combined false alert rate of testing k hypotheses (also called the "overall type I error rate") can be as large as 1-(1-α)^k (exponential in the number of hypotheses k). This is a serious problem and ignoring it can lead to false discoveries. See an earlier post with links to examples.

There are various proposed corrections for multiple testing, the most basic principle being reducing the individual α's. However, the various corrections suffer in this way or the other from reducing statistical power (the probability of detecting a real effect). One important approach is to limit the number of hypotheses to be tested. All this is not new to statisticians and also to some circles of researchers in other areas (a 2008 technical report by the US department of education nicely summarizes the issue and proposes solutions for education research).

"Large-Scale" = many measurements

The multiple testing challenge has become especially prominent in analyzing micro-array genomic data, where datasets have measurements on many genes (k) for a few people (n). In this new area, inference is used more in an exploratory fashion, rather than confirmatory. The literature on "large-k-small-n" problems has also grown considerably since, including a recent book Large-Scale Inference by Bradley Efron.

And now I get to my (hopefully novel) point: empirical research in the social sciences is now moving to the era of "large n and same old k" datasets. This is what I call "large samples". With large datasets becoming more easily available, researchers test few hypotheses using tens and hundreds of thousands of observations (such as lots of online auctions on eBay or many books on Amazon). Yet, the focus has remained on confirmatory inference, where a set of hypotheses that are derived from a theoretical model are tested using data. What happens to multiple testing issues in this environment? My claim is that they are gone! Decrease α to your liking, and you will still have more statistical power than you can handle.

But wait, it's not so simple: With very large samples, the p-value challenge kicks in, such that we cannot use statistical significance to infer practically significant effects. Even if we decrease α to a tiny number, we'll still likely get lots of statistically-significant-but-practically-meaningless results.

The bottom line is that with large samples (large-n-same-old-k), the approach to analyzing data is totally different: no need to worry about multiple testing, which is so crucial in small samples. This is only one among many other differences between small-sample and large-sample data analysis.

Friday, June 17, 2011

Scatter plots for large samples

While huge datasets have become ubiquitos in fields such as genomics, large datasets are now also becoming to infiltrate research in the social sciences. Data from eCommerce sites, online dating sites, etc. are now collected as part of research in information systems, marketing and related fields. We can now find social science research papers with hundreds of thousands of observations and more.

A common type of research question in such studies is about the relationship between two variables. For example, how does the final price of an online auction relate to the seller's feedback rating? A classic exploratory tool for examining such questions (before delving into formal data analysis) is the scatter plot. In small sample studies, scatter plots are used for exploring relationships and detecting outliers.

Image from http://prsdstudio.com/

With large samples, however, the scatter plot runs into a few problems. With lots of observations, there is likely to be too much overlap between markers on the scatter plot, even to the point of insufficient pixels to display all the points.

Here are some large-sample strategies to make scatter plots useful:

Aggregation: display groups of observations in a certain area on the plot as a single marker. Size or color can denote the number of aggregated observations.
Small-multiples: split the data into multiple scatter plots by breaking down the data into (meaningful) subsets. Breaking down the data by geographical location is one example. Make sure to use the same axis scales on all plots - this will be done automatically if your software allows "trellising".
Sample: draw smaller random samples from the large dataset and plot them in multiple scatter plots (again, keep the axis scales identical on all plots).
Zoom-in: examine particular areas of the scatter plot by zooming in

Finally, with large datasets it is useful to consider charts that are based on aggregation such as histograms and box plots. For more on visualization, see the Visualization chapter in Data Mining for Business Intelligence.

Monday, December 13, 2010

Discovering moderated relationship in the era of large samples

I am currently visiting the Indian School of Business (ISB) and enjoying their excellent library. As in my student days, I roam the bookshelves and discover books on topics that I know little, some, or a lot. Reading and leafing through a variety of books, especially across different disciplines, gives some serious points for thought.

As a statistician I have the urge to see how statistics is taught and used in other disciplines. I discovered an interesting book coming from the psychology literature by Herman Aguinas called Regression Analysis for Categorical Moderators. "Moderators" in statistician language is "interactions". However, when social scientists talk about moderated relationships or moderator variables, there is no symmetry between the two variables that create the interaction. For example if X1=education level, X2=Gender, and Y=Satisfaction at work, then an inclusion of the moderator X1*X2 would follow a direct hypothesis such as "education level affects satisfaction at work differently for women and for men."

Now to the interesting point: Aguinis stresses the scientific importance of discovering moderated relationships and opens the book with the quote:

"If we want to know how well we are doing in the biological, psychological, and social sciences, an index that will serve us well is how far we have advanced in our understanding of the moderator variables of our field." --Hall & Rosenthal, 1991

Discovering moderators is important for understanding the bounds of generalizability as well as for leading to adequate policy recommendations. Yet, it turns out that "Moderator variables are difficult to detect even when the moderator test is the focal issue in a research study and a researcher has designed the study specifically with the moderator test in mind."

One main factor limiting the ability to detect moderated relationships (which tend to have small effects) is statistical power. Aguinas describes simulation studies showing this:

a small effect size was typically undetected when sample size was as large as 120, and ...unless a sample size of at least 120 was used, even ... medium and large moderating effects were, in general, also undetected.

This is bad news. But here is the good news: today, even researchers in the social sciences have access to much larger datasets! Clearly n=120 is in the past. Since this book has come out in 2004, have there been large-sample studies of moderated relationships in the social sciences?

I guess that's where searching electronic journals is the way to go...