Tuesday, October 27, 2009

Testing directional hypotheses: p-values can bite

I've recently had interesting discussions with colleagues in Information Systems regarding testing directional hypotheses. Following their request, I'm posting about this apparently illusive issue.

In information systems research, the most common type of hypothesis is directional, i.e. the parameter of interest is hypothesized to go in a certain direction. An example would be testing the hypothesis that teenagers are more likely than older folks to use Facebook. Another example is the hypothesis that higher opening bids on eBay lead to higher final prices. In the Facebook example, the researcher would test the hypothesis by gathering data on Facebook usage by each age group, then comparing the average usage of each group, and if the teenager's average is sufficiently larger, then the hypothesis would be supported (at some statistically significant level). In the eBay example, a researcher might collect information on many eBay auctions, then fit a regression of price on the opening bid (and controlling for all other types of factors). If the regression coefficient turns out to be sufficiently larger than zero, then the researcher could conclude that the hypothesized effect is true (let's put aside issues of causality for the moment).

More formally, for the Facebook hypothesis the test statistic would be a T statistic of the form
T = (teenager Average - older folks Average) / Standard Error
The test statistic for the eBay example would also be a T statistics of the form:
T = opening-bid regression coefficient / Standard Error

Note an important point here: when stating a hypothesis as above (namely, "the alternative hypothesis"), there is always a null hypothesis that is the default. This null hypothesis is often neglected to be mentioned expliciltly in Information Systems articles, but let's make clear that in directional hypotheses such as the ones above, the null hypothesis includes both the "no effect" and the "opposite directional effect" scenarios. In the Facebook example, the null includes both the case that teenagers and older folks use Facebook equally, and that teenagers use Facebook less than older folks. In the eBay example, the null includes both cases of "opening bid doesn't affect final price" and "opening bid lowers final price".

Getting back to the T test statistics (or any other test statistic, for this matter): To evaluate whether the T is sufficiently extreme to reject the null hypothesis (and support the researcher's hypothesis), information systems researchers typically use a p-value, and compare it to some significince level. BUT, computing the p-values must take into account the directionality of the hypothesis! The default p-value that you'd get from running a regression model in any standard software is for a non-directional hypothesis! To get the directional p-value you would either divide that p-value by 2, if the sign of the T statistic is in the "right" direction (positive if your hypothesis said positive; negative if your hypothesis said negative), or you would have to use 1-p-value/2. In the first case, mistakenly using the software p-value would result in missing out on real effects (loss of statistical power), while in the latter case you might infer an effect, when there is none (or maybe there even is an effect in the opposite direction).

The solution to this confusion is to examine each hypothesis for its directionality (think what the null hypothesis is), then construct the corresponding p-value carefully. Some tests in some software packages will allow you to specify the direction and will give you a "kosher" p-value. But in many cases, regression being an example, most software will only spit out the no-directional p-value. Or just get a die-hard statistician on board.

Which reminds me again why I don't like p-values. For lovers of confidence intervals, I promise to post about confidence intervals for directional hypotheses (what is the sound of a one-sided confidence interval?)


6 comments:

Elyas Akram said...

If you cringe at the thought/overuse of p-values do you feel the same way about confidence intervals? Aren’t they compatible, or is the issue more that people are just more prone to the misuse of p-values but for some unexplained reason users properly reference confidence intervals?

Elyas Akram

Galit Shmueli said...

A confidence interval is not prone to the gap formed between statistical significance and practical significance, because you are forced to look at the magnitude of the coefficient. For instance, an interval such as (.1111111, .1111112) indicates that the population coefficient is most likely in this range. Whether this magnitude is meaningful or not depends on the application, but you can't ignore it as you might when you say "the p-value was 0, so the coefficient is statistically significant". Yes, confidence intervals are the way.

Unknown said...

How come other professors are not encouraging students to use the confidence interval instead of p-values when doing hypothesis testing? Is this a relatively new finding?

Unknown said...

could it be possible that for the regression case, the none effect null hypothesis is tested checking the p-value and the "opposite directional effect null hypothesis is be check the sign of the Beta and the p-value?

If the p-value is significant but the beta is negative instead of positive could that reject the opposite directional effect null hypothesis ?

Galit Shmueli said...

Roxana - Sorry I missed your comment until now. But to respond to your question: While this problem with p-values is not a new issue, it has become much more prevalent due to the availability of large datasets. Most statistics textbooks (especially those written for the social sciences) were written in the mindset of too-little-data.

Although confidence intervals can be used in place of p-values, the research culture in many fields has been trained for years to use p-values. It's hard to change cultures...

Galit Shmueli said...

Ivan,
If you are very careful in specifying your null hypothesis, then at least you will be testing the correct hypothesis. If your null hypothesis in a regression setting is beta=0 (i.e., non-directional), then indeed the p-values that the software yields is testing that hypothesis. If your null is directional and the resulting coefficient is statistically significant (according to the p-value) but with the sign in the opposite direction, then indeed it is saying quite strongly that the effect is in the opposite way of your hypothesis. However, there are philosophical issues with constructing an opposite hypothesis test after finding out that the sign is in the opposite direction... The idea in hypothesis testing is to set the null according to a theory, and then to use the data and statistical test to test that theory.