BzST | Business Analytics, Statistics, Teaching

Monday, April 21, 2008

Good predictions by wrong model?

Are explaining and predicting the same? An age-old debate in philosophy of science started with Hempel & Oppenheim's 1948 paper that equates the logical structure of predicting and explaining (saying that in effect they are the same, except that in explaining the phenomenon already happened while in prediction it hasn't occurred). Later on it was recognized that the two are in fact very different.

When it comes to statistical modeling, how are the two different? Do we model data differently when the goal is to explain than to predict? In a recent paper co-authored with Otto Koppius from Erasmus University, we show how modeling is different in every step.

Let's take the argument to an extreme: Can a wrong model lead to correct predictions? Well, here's an interesting example: Although we know that the ancient Ptolemaic astronomic model, which postulates that the universe revolves around earth, is wrong it turns out that this model generated very good predictions of planet motion, speed, brightness, and sizes as well as eclipse times. The predictions are easy to compute and fairly accurate that they still serve today as engineering approximations and have even been used in navigation until not so long ago.

So how does a wrong model produce good predictions? It's all about the difference between causality and association. A "correct" model is one that identifies the causality structure. But for a good predictive model all we need are good associations!

Tuesday, April 15, 2008

Are conditional probabilities intuitive?

Somewhere in the early 90's I started as a teaching assistant for the "intro to probability" course. Before introducing conditional probabilities, I recall presenting the students with the "Let's make a deal" problem that was supposed to show them that their intuition is often wrong and therefore they should learn about laws of probability, and especially conditional probability and Bayes' Rule. This little motivation game was highlighted in last week's NYT with an extremely cool interactive interface: welcome to the Monty Hall Problem!

The problem is nicely described in Wikipedia:
Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?

The initial thought that crosses one's mind is "it doesn't matter if you switch or not" (i.e. probability of 1/2 that the car is behind each of the two closed doors). Turns out that switching is the optimal strategy: if you switch there's a probability of 2/3 to win the car, but if you stay it's only 1/3.

How can this be? note that the door that the host opens is chosen such that it has a goat behind it. In other words, there is some new information that comes in once the door gets opened. The idea behind the solution is to condition on the information that the door that opened had a goat, and therefore we look at event pairs such as "goat-then-car", "goat-then-goat". In probability language, we move from P(car behind door 1) to P(car behind door 1 GIVEN goat behind door 3).

The Tierney Lab, by NYT's blogger John Tierney, writes about the psychology behind the deception in this game. [Thanks to Thomas Lotze for pointing me to this posting!] He quotes a paper by Fox & Levav (2004) that gets to the core of why people get deceived:

People seem to naturally solve probability puzzles by partitioning the set of possible events {Door 1; Door 2; Door 3}, editing out the possibilities that can be eliminated (the door that was revealed by the host), and counting the remaining possibilities, treating them as equally likely (each of two doors has a ½ probability of containing the prize).

In other words, they ignore the host. And then comes the embarrassing part about asking MBAs who took a probability course, and they too get it wrong. The authors conclude with a suggestion to teach probability differently:
We suggest that introductory probability courses shouldn’t fight this but rather play to these natural intuitions by starting with an explanation of probability in terms of interchangeable events and random sampling.

What does this mean? My interpretation is to use trees when teaching conditional probabilities. Looking at a tree for the Monty Hall game (assuming that you initially choose door 1) shows the asymmetry of the different options and the effect of the car location relative to your initial choice. I agree that trees are a much more intuitive and easy way to compute and understand conditional probabilities. But I'm not sure how to pictorially show Bayes' Rule in an intuitive way. Ideas anyone?

Wednesday, April 02, 2008

Data Mining Cup 2008 releases data today

Although the call for this competition has been out for a while on KDnuggets.com, today is the day when the data and the task description are released. This data mining competition is aimed at students. The prizes probably might not sound that attractive to student ("participation in the KDD 2008, the world's largest international conference for "Knowledge Discovery and Data Mining" (August 24-27, 2008 in Las Vegas)", so I'd say the real prize is cracking the problem and winning!

An interesting related story that I recently heard from Chris Volinsky from the Belkor team (who is currently in first place) is the high level of collaboration that competing teams have been exhibiting during the Netflix Prize. Although you'd think the $1 million would be a sufficient incentive for not sharing, it turns out that the fun of the challenge leads teams to collaborate and share ideas! You can see some of this collaboration on the NetflixPrize Forum.

Thursday, March 06, 2008

Mining voters

While the presidential candidates are still doing their dances, it's interesting to see how they use datamining for improving their stance: The candidates apparently use companies that mine their voter databases in order to "micro-target" voters via ads and the like. See this blog posting on The New Republic-- courtesy of former student Igor Nakshin. Note also the comment about the existence of various such companies that tailor to the different candidates.

It would be interesting to test the impact of this "mining" on actual candidate voting and to compare the different tools. But how can this be done in an objective manner without the companies actually sharing their data? That would fall in the area of "privacy-preserving data mining".

New data repository by UN

As more government and other agencies move "online", some actually make their data publicly available. Adi Gadwale, one of my dedicated ex-students, sent a note about a new neat data repository made publicly available by the UN called UNdata. You can read more about it in the UN News bulletin or go directly to repository at http://data.un.org

The interface is definitely easy to navigate. Lots of time series for the different countries on many types of measurements. This is a good source of data that can be used to supplement other existing datasets (like one would use US census data to supplement demographic information).

Another interesting data repository is TRAC. It's mission is to obtain and provide all information that should be public by the Freedom of Information Act. It has data on many US agencies. Some data are free for download, but to get access to all the neat stuff you (or your institution) need a subscription.

Thursday, February 28, 2008

Forecasting with econometric models

Here's another interesting example where explanatory and predictive tasks create different models: econometric models. These are essentially regression models of the form:

Y(t) = beta0 + beta1 Y(t-1) + beta2 X(t) + beta3 X(t-1) + beta4 Z(t-1) + noise

An example would be forecasting Y(t)= consumer spending at time t, where the input variables can be consumer spending in previous time periods and/or other information that is available at time t or earlier.

In economics, when Y(t) is the state of the economy at time t, there is a distinction between three types of variables (aka "indicators"): Leading, coincident, and lagging variables. Leading indicators are those that change before the economy changes (e.g. the stock market); coincident indicators change during the period when the economy changes (e.g., GDP), and lagging indicators change after the economy changes (e.g., unemployment). -- see about.com.

This distinction is especially revealing when we consider the difference between building an econometric model for the purpose of explaining vs. forecasting. For explaining, you can have both leading and coincident variables as inputs. However, if the purpose is forecasting, the inclusion of coincident variables requires one to forecast them before they can be used to forecast Y(t). An alternative is to lag those variables and include them only in leading-indicator format.

I found a neat example of a leading indicator on thefreedictionary.com: The "Leading Lipstick Indicator"

is based on the theory that a consumer turns to less-expensive indulgences, such as lipstick, when she (or he) feels less than confident about the future. Therefore, lipstick sales tend to increase during times of economic uncertainty or a recession. This term was coined by Leonard Lauder (chairman of Estee Lauder), who consistently found that during tough economic times, his lipstick sales went up. Believe it or not, the indicator has been quite a reliable signal of consumer attitudes over the years. For example, in the months following the Sept 11 terrorist attacks, lipstick sales doubled

Tuesday, February 26, 2008

Data mining competition season

Those who've been following my postings probably recall "competition season" when all of a sudden there are multiple new interesting datasets out there, each framing a business problem that requires the combination of data mining and creativity.

Two such competitions are the SAS Data Mining Shootout and the 2008 Neural Forecasting Competition. The SAS problem concerns revenue management for an airline who wants to improve their customer satisfaction. The NN5 competition is about forecasting cash withdrawals from ATMs.

Here are the similarities between the two competitions: they both provide real data and reasonably real business problems. Now to a more interesting similarity: they both have time series forecasting tasks. From a recent survey on the popularity of types of data mining techniques, it appears that time series are becoming more and more prominent. They also both require registration in order to get access to the data (I didn't compare their terms of use, but that's another interesting comparison), and welcome any type of modeling. Finally, they are both tied to a conference, where competitors can present their results and methods.

What would be really nice is if, like in KDD, the winners' papers would be published online and made publicly available.