BzST | Business Analytics, Statistics, Teaching: performance metrics

Showing posts with label performance metrics. Show all posts

Saturday, September 04, 2010

Forecasting stock prices? The new INFORMS competition

Image from www.lumaxart.com

The 2010 INFORMS Data Mining Contest is underway. This time the goal is to predict 5-minute stock prices. That's right - forecasting stock prices! In my view, the meta-contest is going to be the most interesting part. By meta-contest I mean looking beyond the winning result (what method, what prediction accuracy) and examining the distribution of prediction accuracies across all the contestants, how the winner is chosen, and most importantly, how the winning result will be interpreted in terms of concluding about the predictability level of stocks.

Why is a stock prediction competition interesting? Because according to the Efficient Market Hypothesis (EMH), stocks and other traded assets are random walks (no autocorrelation between consecutive price jumps). In other words, they are unpredictable. Even if there is a low level of autocorrelation, then the bid-offer spread and transaction costs make stock predictions worthless. I've been fascinated with how quickly and drastically the Wikipedia page on the Efficient Market Hypothesis has changed in the last years (see the page history). The proponents of the EMH seem to be competing with its opponents in revising the page. As of today, the opponents are ahead in terms of editing the page -- perhaps the recent crisis is giving them an advantage.

The contest's evaluation page explains that the goal is to forecast whether the stock price will increase or decrease in the next time period. Then, entries will be evaluated in terms of the average AUC (area under the ROC curve). Defining the problem as a binary prediction problem and using the AUC to evaluate the results adds an additional challenge: the average AUC has various flaws in terms of measuring predictive accuracy. In a recent article in the journal Machine Learning, the well-known statistician Prof David Hand shows that in addition to other deficiencies "...the AUC uses different misclassification cost distributions for different classifiers."

In any case, among the many participants in the competition there is going to be a winner. And that winner will have the highest prediction accuracy for that stock, at least in the sense of average AUC. No uncertainty about that. But will that mean that the winning method is the magic bullet for traders? Most likely not. Or, at least, I would not be convinced until I saw the method consistently outperform a random walk across a large number of stocks and different time periods. For one, I would want to see the distribution of results of the entire set of participants and compare it to a naive classifier to evaluate how "lucky" the winner was.

The competition page reads: The results of this contest could have a big impact on the finance industry. I find that quite scary, given the limited scope of the data, the evaluation metric, and the focus on the top results rather than the entire distribution.

Tuesday, March 10, 2009

What R-squared is (and is not)

R-squared (aka "coefficient of determination", or for short, R²) is a popular measure used in linear regression to assess the strength of the linear relationship between the inputs and the output. In a model with a single input, R² is simply the squared correlation coefficient between the input and output.

If you examine a few textbooks in statistics or econometrics, you will find several definitions of R². The most common definition is "the percent of variation in the output (Y) explained by the inputs (X's)". Another definition is "a measure of predictive power" (check out Wikepedia!). And finally, R² is often called a goodness-of-fit measure. Try a quick Google search of "R-squared" and "goodness of fit". I even discovered this Journal of Economics article entitled An R-squared measure of goodness of fit for some common nonlinear regression models.

The first definition is correct, although it might sound overly complicated to a non-statistical ear. Nevertheless, it is correct.

As to R² being a predictive measure, this is an unfortunately popular misconception. There are several problems with R² that make it a poor predictive accuracy measure:

R²always increases as you add inputs, whether they contain useful information or not. This technical inflation in R² is usually overcome by using an alternative metric (R²-adjusted), which penalized R² for the number of inputs included.

R² is computed from a given sample of data that was used for fitting the linear regression model. Hence, it is "biased" towards those data and is therefore likely to be over-optimistic in measuring the predictive power of the model on new data. This is part of a larger issue related to performance evaluation metrics: the best way to assess the predictive power of a model is to test it on new data. To see more about this, check out my recent working paper "To Explain or To Predict?"

Finally, the popular labeling of R² as a goodness-of-fit measure is, in fact, incorrect. This was pointed out to me by Prof. Jeff Simonoff from NYU. R² does not measure how well the data fit the linear model but rather how strong the linear relationship is. Jeff calls it a strength-of-fit measure.

Here's a cool example (thanks Jeff!): If you simulate two columns of uncorrelated normal variables and then fit a regression to the resulting pairs (call them X and Y), you will get a very low R² (practically zero). This indicates that there is no linear relationship between X and Y. However, the model being fitted is actually a regression of Y on X with a slope of zero. In that sense, the data do fit the zero-slope model very well, yet R² tells us nothing of this good fit.

Tuesday, October 07, 2008

Sensitivity, specificity, false positive and false negative rates

I recently had an interesting discussion with a few colleagues in Korea regarding the definition of false positive and false negative rates and their relation to sensitivity and specificity. Apparently there is real confusion out there, and if you search the web you'll find conflicting information. So let's sort this out:

Let's assume we have a dataset of bankrupt and solvent firms. We now want to evaluate the performance of a certain model for predicting bankruptcy. Clearly here, the important class is "bankrupt", as the consequences of misclassifying bankrupt firms as solvent are heavier than misclassifying solvent firms as bankrupt. We organize the data in a confusion matrix (aka classification matrix) that crosses actual firm status with predicted status (generated by the model). Say this is the matrix:

In our textbook Data Mining for Business Intelligence we treat the four metrics as two sets of pairs {sensitivity, specificity} and {false positive rate, false negative rate}, each pair measuring a different aspect. Sensitivity and specificity measure the ability of the model to correctly detect the important class (=sensitivity) and its ability to correctly rule out the unimportant class. This definition is apparently not controversial. In the example, the sensitivity would be 201/(201+85) = the proportion of bankrupt firms that the model accurately detects. The model's specificity here is 2689/(2689+25) = the proportion of solvent firms that the model accurately "rules out".

Now to the controversy: We define the false positive rate as the proportion of important class cases incorrectly classified as non-important among all cases predicted as important. In the example the false positive rate would be 25/(201+25). Similarly, the false negative rate is the % of non-important class cases incorrectly classified as important among all cases predicted as non-important (=85/(85+2689). My colleagues, however, disagreed with this definition. According to their definition false positive rate = 1-specificity, and false negative rate = 1-sensitivity.

And indeed, if you search the web you will find conflicting definitions of false positive and negative rates. However, I claim that our definitions are the correct ones. A nice explanation of the difference between the two pairs of metrics is given on p.37 of Chatterjee et al.'s textbook A Casebook for a First Course in Statistics and Data Analysis (a very neat book for beginners, with all ancillaries on Jeff Simonoff's page):

Consider... HIV testing. The standard test is the Wellcome Elisa test. For any diagnostic test...
(1) sensitivity = P(positive test result | person is actually HIV positive)
(2) specificity = P(negative test result | person is actually not HIV positive)

... the sensitivity fo the Elisa test is approximatly .993 (so only 7% of people who are truly HIV positive would have a negative test result), while the specificity is approximately .9999 (so only .01% of the people who are truly HIV-negative would have a positive test result).

That sounds pretty good. However, these are not the only numbers to consider when evaluating the appropriateness of random testing. A person who tests positive is interested in a different conditional probability: P(preson is actually HIV-positive | a positive test result). That is, what porportion of people who test positive actually are HIV positive? If the incidence of the disease is low, most positive results could be false positives.

My colleague Lele at UMD also pointed out that this confusion has caused some havoc in the field of Education as well. Here is a paper that proposes to go as far as creating two separate confusion matrices and using lower and upper case notations to avoid the confusion!

Convinced?