Friday, February 12, 2010

Over-fitting analogies

To explain the danger of model over-fitting in prediction to data mining newcomers, I often use the following analogy:
Say you are at the tailor's, who will be sewing an expensive suit (or dress) for you. The tailor takes your measurements and asks whether you'd like the suit to fit you exactly, or whether there should be some "wiggle room". What would you choose?
The answer is, "it depends how you plan to use the suit". If you are getting married in a few days, then probably a close fit is desirable. In contrast, if you plan to wear the suit to work throughout the next few years, you'd most likely want some "wiggle room"... The latter case is similar to prediction, where you want to make sure to accommodate new records (your body's measurements during the next few years) that are not exactly identical to the current data. Hence, you want to avoid over-fitting. The wedding scenario is similar to models built for causal explanation, where you do want the model to fit the data well (back to the explanation vs. prediction distinction).

I just found some nice terminology, by Bruce Ratner (GenIQ.net), explaining the idea of over-fitting:
A model is built to represent a training data... not to reproduce a training data. [Otherwise], a visitor from the validation data will not feel at home. The visitor encounters an uncomfortable fit in the model because s/he probabilistically does not look like a typical data-point from the training data. Thus, the misfit visitor takes a poor prediction

1 comment:

Anu Gupta said...

This makes so much sense when tied to the revenue management segments of airline and hotel industry. The idea of quantifying the value add of revenue management for the above stated industries is predicting and going back and checking how far off was the prediction from actuality. Over-fitting the data will lead to low accuracy predictions, diminishing the value add of many practices (not only revenue management and pricing) in various industries today. Hence, a very important piece for a lot of businesses running on the idea of data mining.