Tuesday, February 14, 2006

Data partitioning

A central initial step in data mining is to partition the data into two or three partitions. The first partition is called the training set, the second is the validation set, and if there is a third, it is usually called the test set.

The purpose of data partitioning is to enable evaluating model predictive performance. In contrast to an explanatory goal, where we want to fit the data as closely as possible, good predictive models are those that have high predictive accuracy. Now, if we fit a model to data, then obviously the "tighter" the model, the better it will predict those data. But what about new data? How well will the model predict those?

Predictive models are different from explanatory models in various aspects. But let's only focus on performance evaluation here. Indications of good model fit are usually high R-squared values, low standard-error-of-estimate, etc. These do not measure predictive accuracy.

So how does partioning help measure predictive performance? The training set is first used to fit a model (also called to "train the model".) The validation set is then used to evaluate model performance on new data that it did not "see". At this stage we compare the model predictions for the new validation data to the actual values and use different metrics to quantify predictive accuracy.

Sometimes, we actually use the validation set to tweak the original model. In other words, after seeing how the model performed on the validation data, we might go back and change the model. In that case we are "using" our validation data, and the model is no longer blind to them. This is when a third, test set, comes in handy. The final evaluation of predictive performance is then achieved by applying the model (which is based on the training data and tweaked using the validation data) to the test data that it never "saw".

No comments: