Thursday, March 01, 2007

Lots of real time series data!

I love data-mining or statistics competitions - they always provide great real data! However, the big difference between a gold mine and "just some data" is whether the data description and their context is complete. This reflects, in my opinion, the difference between "data mining for the purpose of data mining" vs. "data mining for business analytics" (or any other field of interest, such as engineering or biology).

Last year, the BICUP2006 posted an interesting dataset on bus ridership in Santiego de Chile. Although there was a reasonable description of the data (number of passengers at a bus stations at 15-minute intervals), there was no information on the actual context of the problem. The goal of the competition was to accuractly forecast 3 days into the future of the data given. Although this has its challenges, the main question is whether a method that accurately predicts these 3 days would be useful for the Santiago Transportation Bureau, or anyone else outside of the competition. For instance, the training data included 3 weeks, where there is a pronounced weekday/weekend effect. However, the prediction set include only 3 weekdays. A method that predicts accuractly weekdays might suffer on weekends. It is therefore imperative to include the final goal of the analysis. Will this forecaster be used to assist in bus scheduling on weekdays only? during rush hours only? How accurate do the forecasts need to be for practical use? Maybe a really simple model predicts accuractly enough for the purpose at hand.

Another such instance is the upcoming NN3 Forecasting Competition (as part of the 2007 International Symposium on Forecasting). The dataset includes 111 time series, varying in length (about 40-140 time points). However, I have not found any description neither of the data nor of the context. In reality we would always know at least the time frequency: are these measurements every second? minute? day? month? year? This information is obviously important for determining factors like seasonality and which methods are appropriate.
To download the data and see a few examples, you will need to register your email.

An example of a gold mine is the T-competition, which concentrates on forecasting transportation data. In addition to the large number of series (ranging in length and at various frequencies from daily to yearly), there is a solid description of what each series is, and the actual dates of measurement. They even include a set of seasonal indexes for each series. The data come from an array of transportation measurements in both Europe and North America.

No comments: