BzST | Business Analytics, Statistics, Teaching: t-test

Monday, March 02, 2015

Psychology journal bans statistical inference; knocks down server

In its recent editorial, the journal Basic and Applied Social Psychology announced that it will no longer accept papers that use classical statistical inference. No more p-values, t-tests, or even... confidence intervals!

"prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about ‘‘significant’’ differences or lack thereof, and so on)... confidence intervals also are banned from BASP"

Many statisticians would agree that it is high time to move on from p-values and statistical inference to practical significance, estimation, more elaborate non-parametric modeling, and resampling for avoiding assumption-heavy models. This is especially so now, when datasets are becoming larger and technology is able to measure more minute effects.

In our 2013 paper "Too Big To Fail: Large Samples and the p-value Problem" we raise the serious issue p-value-based decision making when using very large samples. Many have asked us for solutions that scale up p-values, but we haven't come across one that really works. Our focus was on detecting when you're "too large" and we emphasized the importance of focusing on effect magnitude, and precision (please do report standard errors!)

Machine learners would probably advocate finally moving to predictive modeling and evaluation. Predictive power is straightforward to measure, although it isn't always what social science researchers are looking for.

But wait. What this editorial dictates is only half a revolution: it says what it will ban. But it does not offer a cohesive alternative beyond simple summary statistics. Focusing on effect magnitude is great for making results matter, but without reporting standard errors or confidence intervals, we don't know anything about the uncertainty of the effect. Abandoning any metric that relies on "had the experiment been replicated" is dangerous and misleading. First, this is more a philosophical assumption than an actual re-experimentation. Second, to test whether effects found in a sample generalize to a population of interest, we need the ability to replicate the results. Standard errors give some indication of how replicable the results are, under the same conditions.

Controversial editorial leads to heavy traffic on journal server

BASP's revolutionary decision has been gaining attention outside of psychology (a great tactic to promote a journal!) so much so that at times it is difficult to reach the controversial editorial. Some statisticians have blogged about this decision, others are tweeting. This is a great way to open a discussion about empirical analysis in the social sciences. However, we need to come up with alternatives that focus on uncertainty and the ultimate goal of generalization.

Thursday, March 29, 2007

Stock performance and CEO house size study

The BusinessWeek article "The CEO Mega-Mansion Factor" (April 2, 2007) definitely caught my attention -- Two finance professors (Liu and Yermack) collected data on house sizes of CEOs of the S&P 500 companies in 2004. Their theory is "If home purchases represent a signal of commitment by the CEO, subsequent stock performance of the company should at least remain unchanged and possibly improve. Conversely, if home purchases represent a signal of entrenchment, we would expect stock performance to decline after the time of purchase." The article summarizes the results: "[they] found that 12% of [CEOs] lived in homes of at least 10,000 square feet, or a minimum of 10 acres. And their companies' stocks? In 2005 they lagged behind those of S&P 500 CEOs living in smaller houses by 7%, on average".

At this point I had to find out more details! I tracked the research article called "Where are the shareholder's mansions? CEOs' home purchases, stock sales, and subsequent company performance", which contains further details about the data and the analysis. The authors describe the tedious job of assembling the house information data from multiple databases, dealing with missing values and differences in information from different sources. A few questions come to mind:

A plot of value of CEO residence vs. CEO tenure in office (both in log scale) has a suspicious fan-shape, indicating that the variability in residence value increases in CEO tenure. If this is true, it would mean that the fitted regression line (with slope .15) is not an adequate model and therefore its interpretation not valid. A simple look at the residuals would give the answer.
The exploratory step indicates a gap between the performance of below-median CEO house sizes and above-median houses. Now the question is whether the difference is random or reflects a true difference. In order to test the statistical significance of these differences the researchers had to define "house size". They decided to do the following (due to missing values):
"We adopt a simple scheme for classifying a CEO’s residence as “large” if it has either 10,000 square feet of floor area or at least 10 acres of land." While this rule is somewhat ad hoc, it fits our data nicely by identifying about 15% of the sample residences as extremely large.Since this is an arbitrary cutoff, it is important to evaluate its effect on the results: what happens if other cutoffs are used? Is there a better way to combine the information that is not missing in order to obtain a better metric?
The main statistical tests, which compare the stock performances of different types of houses (above- vs. below-median market values; "large" vs. not-"large" homes), are a series of t-tests for comparing means and Wilcoxon tests for comparing medians. Of all 8 performed tests, only one ended up with a p-value below 5%. The one exception is a difference between median stock performance of "large-home" CEOs and "not-large home" CEOs. Recall that this is based on the arbitrary definition of a "large" home. In other words, the differences in stock performances do not appear to be strongly statistically significant. This might improve as the sample sizes are increased -- a large number of observations was dropped due to missing values.
Finally, another interesting point is how the model can be used. BusinessWeek quotes Yermack: "If [the CEO] buys a big mansion, sell the stock". Such a claim means that house size is predictive of stock performance. However, the model (as described in the research paper) was not constructed as a predictive model: there is no holdout set to evaluate predictive accuracy, and no predictive measures are mentioned. Finding a statistically significant relationship between house size and subsequent stock performance is not necessarily indicative of predictive power.