Thursday, February 28, 2008

Forecasting with econometric models

Here's another interesting example where explanatory and predictive tasks create different models: econometric models. These are essentially regression models of the form:

Y(t) = beta0 + beta1 Y(t-1) + beta2 X(t) + beta3 X(t-1) + beta4 Z(t-1) + noise

An example would be forecasting Y(t)= consumer spending at time t, where the input variables can be consumer spending in previous time periods and/or other information that is available at time t or earlier.

In economics, when Y(t) is the state of the economy at time t, there is a distinction between three types of variables (aka "indicators"): Leading, coincident, and lagging variables. Leading indicators are those that change before the economy changes (e.g. the stock market); coincident indicators change during the period when the economy changes (e.g., GDP), and lagging indicators change after the economy changes (e.g., unemployment). -- see about.com.

This distinction is especially revealing when we consider the difference between building an econometric model for the purpose of explaining vs. forecasting. For explaining, you can have both leading and coincident variables as inputs. However, if the purpose is forecasting, the inclusion of coincident variables requires one to forecast them before they can be used to forecast Y(t). An alternative is to lag those variables and include them only in leading-indicator format.

I found a neat example of a leading indicator on thefreedictionary.com: The "Leading Lipstick Indicator"
is based on the theory that a consumer turns to less-expensive indulgences, such as lipstick, when she (or he) feels less than confident about the future. Therefore, lipstick sales tend to increase during times of economic uncertainty or a recession. This term was coined by Leonard Lauder (chairman of Estee Lauder), who consistently found that during tough economic times, his lipstick sales went up. Believe it or not, the indicator has been quite a reliable signal of consumer attitudes over the years. For example, in the months following the Sept 11 terrorist attacks, lipstick sales doubled

Tuesday, February 26, 2008

Data mining competition season

Those who've been following my postings probably recall "competition season" when all of a sudden there are multiple new interesting datasets out there, each framing a business problem that requires the combination of data mining and creativity.

Two such competitions are the SAS Data Mining Shootout and the 2008 Neural Forecasting Competition. The SAS problem concerns revenue management for an airline who wants to improve their customer satisfaction. The NN5 competition is about forecasting cash withdrawals from ATMs.

Here are the similarities between the two competitions: they both provide real data and reasonably real business problems. Now to a more interesting similarity: they both have time series forecasting tasks. From a recent survey on the popularity of types of data mining techniques, it appears that time series are becoming more and more prominent. They also both require registration in order to get access to the data (I didn't compare their terms of use, but that's another interesting comparison), and welcome any type of modeling. Finally, they are both tied to a conference, where competitors can present their results and methods.

What would be really nice is if, like in KDD, the winners' papers would be published online and made publicly available.