BzST | Business Analytics, Statistics, Teaching

Thursday, December 23, 2010

No correlation -> no causation?

Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences

I found an interesting variation on the "correlation does not imply causation" mantra in the book Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences by Cohen et al. (apparently one of the statistics bibles in behavioral sciences). The quote (p.7) looks like this:

Correlation does not prove causation; however, the absence of correlation implies the absence of the existence of a causal relationship

Let's let the first part rest in peace. At first glance, the second part seems logical: you find no correlation, then how can there be causation? However, after further pondering I reached the conclusion that this logic is flawed, and that one might observe no correlation when in fact there exists underlying causation. The reason is that causality is typically discussed at the conceptual level while correlation is computed at the measurable data level.

Where is Waldo?

Consider an example where causality is hypothesized at an unmeasurable conceptual level, such as "higher creativity leads to more satisfaction in life". Computing the correlation between "creativity" and "satisfaction" requires operationalizing these concepts into measurable variables, that is, identifying measurable variables that adequately represent these underlying concepts. For example, answers to survey questions regarding satisfaction in life might be used to operationalize "satisfaction", while a Rorschach test might be used to measure "creativity". This process of operationalization obviously does not lead to perfect measures, not to mention that data quality can be sufficiently low to produce no correlation even if there exists an underlying causal relationship.

In short, the absence of correlation can also imply that the underlying concepts are hard to measure, are inadequately measured, or that the quality of the measured data is too low (i.e., too noisy) for discovering a causal underlying relationship.

Monday, December 13, 2010

Discovering moderated relationship in the era of large samples

I am currently visiting the Indian School of Business (ISB) and enjoying their excellent library. As in my student days, I roam the bookshelves and discover books on topics that I know little, some, or a lot. Reading and leafing through a variety of books, especially across different disciplines, gives some serious points for thought.

As a statistician I have the urge to see how statistics is taught and used in other disciplines. I discovered an interesting book coming from the psychology literature by Herman Aguinas called Regression Analysis for Categorical Moderators. "Moderators" in statistician language is "interactions". However, when social scientists talk about moderated relationships or moderator variables, there is no symmetry between the two variables that create the interaction. For example if X1=education level, X2=Gender, and Y=Satisfaction at work, then an inclusion of the moderator X1*X2 would follow a direct hypothesis such as "education level affects satisfaction at work differently for women and for men."

Now to the interesting point: Aguinis stresses the scientific importance of discovering moderated relationships and opens the book with the quote:

"If we want to know how well we are doing in the biological, psychological, and social sciences, an index that will serve us well is how far we have advanced in our understanding of the moderator variables of our field." --Hall & Rosenthal, 1991

Discovering moderators is important for understanding the bounds of generalizability as well as for leading to adequate policy recommendations. Yet, it turns out that "Moderator variables are difficult to detect even when the moderator test is the focal issue in a research study and a researcher has designed the study specifically with the moderator test in mind."

One main factor limiting the ability to detect moderated relationships (which tend to have small effects) is statistical power. Aguinas describes simulation studies showing this:

a small effect size was typically undetected when sample size was as large as 120, and ...unless a sample size of at least 120 was used, even ... medium and large moderating effects were, in general, also undetected.

This is bad news. But here is the good news: today, even researchers in the social sciences have access to much larger datasets! Clearly n=120 is in the past. Since this book has come out in 2004, have there been large-sample studies of moderated relationships in the social sciences?

I guess that's where searching electronic journals is the way to go...

Tuesday, November 16, 2010

November Analytics magazine on BI

click to read the latest issue

A bunch of interesting articles about business analytics and predictive analytics from a managerial point of view, in the November issue of INFORMS Analytics magazine.

Sunday, November 14, 2010

Data visualization in the media: Interesting video

A colleague who knows my fascination with data visualization pointed me to a recent interesting video created by Geoff McGhee on Journalism in the Age of Data. In this 8-part video, he interviews media people who create visualizations for their websites at the New York Times, Washington Post, CNBC, and more. It is interesting to see their view of why interactive visualization might be useful to their audience, and how it is linked to "good journalism".

Also interviewed are a few visualization interface developers (e.g., IBM's Many Eyes designers) as well as Infographics experts and participants at the major Inforgraphics conference in Pamplona, Spain. The line between beautiful visualizations (art) and effective ones is discussed in Part IV ("too sexy for its own good" - Gert Nielsen) - see also John Grimwade's article.

Journalism in the Age of Data from Geoff McGhee on Vimeo.

The videos can be downloaded as a series of 8 podcasts, for those with narrower bandwidth.

Wednesday, November 10, 2010

ASA's magazine: Excel's default charts

Being in Bhutan this year, I have requested the American Statistical Association (ASA) and INFORMS to mail the magazines that come with my membership to Bhutan. Although I can access the magazines online, I greatly enjoy receiving the issues by mail (even if a month late) and leafing through them leisurely. Not to mention the ability to share them with local colleagues who are seeing these magazines for the first time!

Now to the data-analytic reason for my post: The main article in the August 2010 issue of AMSTAT News (the ASA's magazine) on Fellow Award: Revisited (Again) presented an "update to previous articles about counts of fellow nominees and awardees." The article comprised of many tables and line charts. While charts are a great way to present a data-based story, the charts in this article were of low quality (see image below). Apparently, the authors used Excel 2003's defaults, which have poor graphic qualities and too much chart-junk: a dark grey background, horizontal gridlines, line color not very suitable for black-white printing (such as the print issue), a redundant combination of line color and marker shape, and redundant decimals on several of the plot y-axis labels.

As the flagship magazine of the ASA, I hope that the editors will scrutinize the graphics and data visualizations used in the articles, and perhaps offer authors access to a powerful data visualization software such as TIBCO Spotfire, Tableau, or SAS JMP. Major newspapers such as the New York Times and Washington Post now produce high-quality visualizations. Statistics magazines mustn't fall behind!

Thursday, September 30, 2010

Neat data mining competition; strange rule?

I received notice of an upcoming data mining competition by the Direct Marketing Association. The goal is to predict sales volume of magazines at 10,000 newsstands, using real data provided by CMP and Experian. The goal is officially stated as:

The winner will be the contestant who is able to best predict final store sales given the number of copies placed (draw) in each store. (Best will be defined as the root mean square error between the predicted and final sales.)

Among the usual competition rules about obtaining the data, evaluation criteria, etc. I found an odd rule stating: P.S. PARTICIPANTS MAY NOT INCLUDE ANY OTHER EXTERNAL VARIABLES FOR THE CHALLENGE. [caps are original]

It is surprising that contestants are not allowed to supplement the competition data with other, possibly, relevant information! In fact, "business intelligence" is often achieved by combining unexpected pieces of information. Clearly, the type of information that should be allowed is only information that is available at the time of prediction. For instance, although weather is likely to affect sales, it is a coincident indicator and requires forecasting in order to include as a predictor. Hence, the weather at the time of sale should not be used, but perhaps the weather forecast can (the time lag between the time of prediction and time of forecast must, of course, be practical).

For details and signing up, see http://www.hearstchallenge.com.

Saturday, September 04, 2010

Forecasting stock prices? The new INFORMS competition

Image from www.lumaxart.com

The 2010 INFORMS Data Mining Contest is underway. This time the goal is to predict 5-minute stock prices. That's right - forecasting stock prices! In my view, the meta-contest is going to be the most interesting part. By meta-contest I mean looking beyond the winning result (what method, what prediction accuracy) and examining the distribution of prediction accuracies across all the contestants, how the winner is chosen, and most importantly, how the winning result will be interpreted in terms of concluding about the predictability level of stocks.

Why is a stock prediction competition interesting? Because according to the Efficient Market Hypothesis (EMH), stocks and other traded assets are random walks (no autocorrelation between consecutive price jumps). In other words, they are unpredictable. Even if there is a low level of autocorrelation, then the bid-offer spread and transaction costs make stock predictions worthless. I've been fascinated with how quickly and drastically the Wikipedia page on the Efficient Market Hypothesis has changed in the last years (see the page history). The proponents of the EMH seem to be competing with its opponents in revising the page. As of today, the opponents are ahead in terms of editing the page -- perhaps the recent crisis is giving them an advantage.

The contest's evaluation page explains that the goal is to forecast whether the stock price will increase or decrease in the next time period. Then, entries will be evaluated in terms of the average AUC (area under the ROC curve). Defining the problem as a binary prediction problem and using the AUC to evaluate the results adds an additional challenge: the average AUC has various flaws in terms of measuring predictive accuracy. In a recent article in the journal Machine Learning, the well-known statistician Prof David Hand shows that in addition to other deficiencies "...the AUC uses different misclassification cost distributions for different classifiers."

In any case, among the many participants in the competition there is going to be a winner. And that winner will have the highest prediction accuracy for that stock, at least in the sense of average AUC. No uncertainty about that. But will that mean that the winning method is the magic bullet for traders? Most likely not. Or, at least, I would not be convinced until I saw the method consistently outperform a random walk across a large number of stocks and different time periods. For one, I would want to see the distribution of results of the entire set of participants and compare it to a naive classifier to evaluate how "lucky" the winner was.

The competition page reads: The results of this contest could have a big impact on the finance industry. I find that quite scary, given the limited scope of the data, the evaluation metric, and the focus on the top results rather than the entire distribution.