Monday, March 20, 2006

Data mining for prediction vs. explanation

A colleague of mine and I have an ongoing discussion about what data mining is about. In particular, he claims that data mining focuses on predictive tasks, whereas I see the use of data mining methods for either prediction or explanation.

Predictive vs. Explanatory Tasks
The distinction between predictive and explanatory tasks is not always easy: Of course, in both cases the goal is "future actionable results". In general, an explanatory goal (also called characterizing or profiling) is when we're interested in learning about factors that affect a certain outcome. In contrast, a predictive goal (also called classification, when we have a categorical response) is when we care less about "insights" and more about accurate predictions of future observations.

In some cases it is very easy to identify an explanatory task, because it simply doesn't make any sense to predict the outcome. For example, we might be looking at performance measures of male and female CEOs, in an attempt to understand the differences (in terms of performance) between the genders. In this case, we would most likely not try to predict the gender of a new observation based on their performance measures...

In most cases, however, it is harder to decide what the task is. This is especially because of the ambiguity of problem descriptions. People will say "I'd like you to use these data to see if we can understand the factors that lead to defaulting on a loan, so that we can predict for new applicants their chance to default". Since it is detrimental for the analysis to know the type of task (explanatory vs. predictive) is requires, I believe the only way out is to try and squeeze it out of the domain person by questions such as "what do you plan to do with the analysis results?" or "what if I found that..."

Here's an example: A marketing manager at a cable company hires your services to analyze their customer data in order "to know the household-characteristics that distinguish subscribers to premium channels (such as HBO) from non-subscribers." Is this a predictive goal? explanatory? It depends! Here are possible scenarios of what they will use the analysis for:
1. To market the premium channel plan in a new market only to people who are more likely to subscribe (predictive)
2. To re-negotiate the cable company's contract with HBO based on what they find (explanatory) -- this scenario is courtesy of a current MBA student

How will the analysis differ?
Whether the goal is predictive or explanatory, the analysis process will take different avenues. This includes what performance measures are devised and used (e.g., to partition or not to partition? use an R-squared or a MAPE?), what types of methods are employed (a k-nearest neighbor won't have much explanatory power), etc.
Post a Comment