Thursday, September 30, 2010

Neat data mining competition; strange rule?

I received notice of an upcoming data mining competition by the Direct Marketing Association. The goal is to predict sales volume of magazines at 10,000 newsstands, using real data provided by CMP and Experian. The goal is officially stated as:
The winner will be the contestant who is able to best predict final store sales given the number of copies placed (draw) in each store. (Best will be defined as the root mean square error between the predicted and final sales.)
Among the usual competition rules about obtaining the data, evaluation criteria, etc. I found an odd rule stating: P.S. PARTICIPANTS MAY NOT INCLUDE ANY OTHER EXTERNAL VARIABLES FOR THE CHALLENGE. [caps are original]

It is surprising that contestants are not allowed to supplement the competition data with other, possibly, relevant information! In fact, "business intelligence" is often achieved by combining unexpected pieces of information. Clearly, the type of information that should be allowed is only information that is available at the time of prediction. For instance, although weather is likely to affect sales, it is a coincident indicator and requires forecasting in order to include as a predictor. Hence, the weather at the time of sale should not be used, but perhaps the weather forecast can (the time lag between the time of prediction and time of forecast must, of course, be practical).

For details and signing up, see http://www.hearstchallenge.com.

No comments: