Thursday, September 30, 2010

Neat data mining competition; strange rule?

I received notice of an upcoming data mining competition by the Direct Marketing Association. The goal is to predict sales volume of magazines at 10,000 newsstands, using real data provided by CMP and Experian. The goal is officially stated as:
The winner will be the contestant who is able to best predict final store sales given the number of copies placed (draw) in each store. (Best will be defined as the root mean square error between the predicted and final sales.)
Among the usual competition rules about obtaining the data, evaluation criteria, etc. I found an odd rule stating: P.S. PARTICIPANTS MAY NOT INCLUDE ANY OTHER EXTERNAL VARIABLES FOR THE CHALLENGE. [caps are original]

It is surprising that contestants are not allowed to supplement the competition data with other, possibly, relevant information! In fact, "business intelligence" is often achieved by combining unexpected pieces of information. Clearly, the type of information that should be allowed is only information that is available at the time of prediction. For instance, although weather is likely to affect sales, it is a coincident indicator and requires forecasting in order to include as a predictor. Hence, the weather at the time of sale should not be used, but perhaps the weather forecast can (the time lag between the time of prediction and time of forecast must, of course, be practical).

For details and signing up, see http://www.hearstchallenge.com.

Saturday, September 04, 2010

Forecasting stock prices? The new INFORMS competition

Image from www.lumaxart.com
The 2010 INFORMS Data Mining Contest is underway. This time the goal is to predict 5-minute stock prices. That's right - forecasting stock prices! In my view, the meta-contest is going to be the most interesting part. By meta-contest I mean looking beyond the winning result (what method, what prediction accuracy)  and examining the distribution of prediction accuracies across all the contestants, how the winner is chosen, and most importantly, how the winning result will be interpreted in terms of concluding about the predictability level of stocks.

Why is a stock prediction competition interesting? Because according to the Efficient Market Hypothesis (EMH), stocks and other traded assets are random walks (no autocorrelation between consecutive price jumps). In other words, they are unpredictable. Even if there is a low level of autocorrelation, then the bid-offer spread and transaction costs make stock predictions worthless. I've been fascinated with how quickly and drastically the Wikipedia page on the Efficient Market Hypothesis has changed in the last years (see the page history). The proponents of the EMH seem to be competing with its opponents in revising the page. As of today, the opponents are ahead in terms of editing the page -- perhaps the recent crisis is giving them an advantage.

The contest's evaluation page explains that the goal is to forecast whether the stock price will increase or decrease in the next time period. Then, entries will be evaluated in terms of the average AUC (area under the ROC curve). Defining the problem as a binary prediction problem and using the AUC to evaluate the results adds an additional challenge: the average AUC has various flaws in terms of measuring predictive accuracy. In a recent article in the journal Machine Learning, the well-known statistician Prof David Hand shows that in addition to other deficiencies "...the AUC uses different misclassification cost distributions for different classifiers."

In any case, among the many participants in the competition there is going to be a winner. And that winner will have the highest prediction accuracy for that stock, at least in the sense of average AUC. No uncertainty about that. But will that mean that the winning method is the magic bullet for traders? Most likely not. Or, at least, I would not be convinced until I saw the method consistently outperform a random walk across a large number of stocks and different time periods. For one, I would want to see the distribution of results of the entire set of participants and compare it to a naive classifier to evaluate how "lucky" the winner was.

The competition page reads: The results of this contest could have a big impact on the finance industry. I find that quite scary, given the limited scope of the data, the evaluation metric, and the focus on the top results rather than the entire distribution.