Monday, March 20, 2006

Data mining for prediction vs. explanation

A colleague of mine and I have an ongoing discussion about what data mining is about. In particular, he claims that data mining focuses on predictive tasks, whereas I see the use of data mining methods for either prediction or explanation.

Predictive vs. Explanatory Tasks
The distinction between predictive and explanatory tasks is not always easy: Of course, in both cases the goal is "future actionable results". In general, an explanatory goal (also called characterizing or profiling) is when we're interested in learning about factors that affect a certain outcome. In contrast, a predictive goal (also called classification, when we have a categorical response) is when we care less about "insights" and more about accurate predictions of future observations.

In some cases it is very easy to identify an explanatory task, because it simply doesn't make any sense to predict the outcome. For example, we might be looking at performance measures of male and female CEOs, in an attempt to understand the differences (in terms of performance) between the genders. In this case, we would most likely not try to predict the gender of a new observation based on their performance measures...

In most cases, however, it is harder to decide what the task is. This is especially because of the ambiguity of problem descriptions. People will say "I'd like you to use these data to see if we can understand the factors that lead to defaulting on a loan, so that we can predict for new applicants their chance to default". Since it is detrimental for the analysis to know the type of task (explanatory vs. predictive) is requires, I believe the only way out is to try and squeeze it out of the domain person by questions such as "what do you plan to do with the analysis results?" or "what if I found that..."

Here's an example: A marketing manager at a cable company hires your services to analyze their customer data in order "to know the household-characteristics that distinguish subscribers to premium channels (such as HBO) from non-subscribers." Is this a predictive goal? explanatory? It depends! Here are possible scenarios of what they will use the analysis for:
1. To market the premium channel plan in a new market only to people who are more likely to subscribe (predictive)
2. To re-negotiate the cable company's contract with HBO based on what they find (explanatory) -- this scenario is courtesy of a current MBA student

How will the analysis differ?
Whether the goal is predictive or explanatory, the analysis process will take different avenues. This includes what performance measures are devised and used (e.g., to partition or not to partition? use an R-squared or a MAPE?), what types of methods are employed (a k-nearest neighbor won't have much explanatory power), etc.


Ravi Bapna said...

so is overfitting more of a concern when trying to predict, or should we be worried about it also in an explantory setting?

corollary: is there a scenario where you would partition the data in the process of developing an explanatory model, or does that only come into play in a prediction setting?

ravi b.

Galit Shmueli said...

That's a very good point. In short, yes, it is more of a concern in predictive modeling because it might be harder to spot. But here's a longer answer:

In an explanatory task overfitting means finding relationships that are just due to noise. But we do have some statistical test to avoid that: since we use methods that shed light on the relatioship between the response and predictors (e.g., regression), overfitting can be avoided by

(1) examining measures such as adjusted-R-squared or p-values of additional predictors, which penalize for the model complexity in an attempt to avoid overfitting (better known, perhaps, as the principle of parsimony, or Occam's Razor).

(2) Sometimes it is possible to use domain knowledge to decide whether the model is over-fitting. An example is that a person's name shows up as a significant predictor of their income...

A major issue is the data size, which is usually much bigger in predictive tasks, thereby rendering standard statistical tests (and p-values) meaningless.

In contrast, in predictive modeling we use performance measures such as MAPE and RMSE over R-squared or p-values, because the end goal is to accurately predict new observations (and especially when the data are large). If we overfit, we'll likely get low predictive accuracy. Given that many predictive methods do not shed any light on the relationship between the response and predictors (e.g., k-nearest-neighbors or Naive Bayes), you cannot really use domain knowledge to avoid overfitting that easily.

Finally, predictive models are often deployed in an automated way: once the model is built, it is deployed multiple times to score (large) databases. Overfitting is then a real concern.

Data partitioning is used especially in predictive modeling because of the need for a platform to evalutate predictive accuracy. In an explanatory task you might partition the data to evaluate model robustness: does the same model fit the training and validation set? Do the parameters drastically change? Is it influenced by outliers? etc.

Galit Shmueli said...

Thanks Hans-Joerg for the philoshophical angle - this definitely further supports the distinction.