Saturday, October 18, 2008

Microsoft and the financial downfall

One of the misleading features of Microsoft Office software is that it gives the user the illusion that they are in control of what's visible and what's hidden to readers of the files. One example is copy-pasting from an Excel sheet into a Word or Power Point. If you now double click on the embedded piece you'll see... the Excel file! It is automatically embedded within the Word/Power Point file. A few years ago, after teaching this to MBAs, a student came the following week all excited, telling me how he just detected fraudulent reporting to his company by a contractor. He simply clicked on a pasted Excel chart within the contractor's report written in Word. The embedded Excel file told all the contractor's secrets.

A solution is to "paste special> as picture". But that's only if you know about this!

Another such feature is Excel's "hidden" fields. You can "hide" certain areas on your Excel spreadsheet, but don't be surprised if those areas are not really hidden: Turns out that Barclays Capital just fell in this trap in their proposal of buying the collapsed investment bank Lehman Brothers. This week's article Lehman Excel snafu could cost Barclays dear tells the story of how "a junior law associate at Cleary Gottlieb Steen & Hamilton LLP converted an Excel file into a PDF format document... Some of these details on various trading contracts were marked as hidden because they were not intended to form part of Barclays' proposed deal. However, this "hidden" distinction was ignored during the reformatting process so that Barclays ended up offering to take on an additional 179 contracts as part of its bankruptcy buyout deal".

The moral:
(1) if you have secrets, don't keep them in Microsoft Office.
(2) if you convert your secrets from Microsoft to something safer (like PDF), check the result of the conversion carefully!

Tuesday, October 07, 2008

Sensitivity, specificity, false positive and false negative rates

I recently had an interesting discussion with a few colleagues in Korea regarding the definition of false positive and false negative rates and their relation to sensitivity and specificity. Apparently there is real confusion out there, and if you search the web you'll find conflicting information. So let's sort this out:

Let's assume we have a dataset of bankrupt and solvent firms. We now want to evaluate the performance of a certain model for predicting bankruptcy. Clearly here, the important class is "bankrupt", as the consequences of misclassifying bankrupt firms as solvent are heavier than misclassifying solvent firms as bankrupt. We organize the data in a confusion matrix (aka classification matrix) that crosses actual firm status with predicted status (generated by the model). Say this is the matrix:







In our textbook Data Mining for Business Intelligence we treat the four metrics as two sets of pairs {sensitivity, specificity} and {false positive rate, false negative rate}, each pair measuring a different aspect. Sensitivity and specificity measure the ability of the model to correctly detect the important class (=sensitivity) and its ability to correctly rule out the unimportant class. This definition is apparently not controversial. In the example, the sensitivity would be 201/(201+85) = the proportion of bankrupt firms that the model accurately detects. The model's specificity here is 2689/(2689+25) = the proportion of solvent firms that the model accurately "rules out".

Now to the controversy: We define the false positive rate as the proportion of important class cases incorrectly classified as non-important among all cases predicted as important. In the example the false positive rate would be 25/(201+25). Similarly, the false negative rate is the % of non-important class cases incorrectly classified as important among all cases predicted as non-important (=85/(85+2689). My colleagues, however, disagreed with this definition. According to their definition false positive rate = 1-specificity, and false negative rate = 1-sensitivity.

And indeed, if you search the web you will find conflicting definitions of false positive and negative rates. However, I claim that our definitions are the correct ones. A nice explanation of the difference between the two pairs of metrics is given on p.37 of Chatterjee et al.'s textbook A Casebook for a First Course in Statistics and Data Analysis (a very neat book for beginners, with all ancillaries on Jeff Simonoff's page):

Consider... HIV testing. The standard test is the Wellcome Elisa test. For any diagnostic test...
(1) sensitivity = P(positive test result | person is actually HIV positive)
(2) specificity = P(negative test result | person is actually not HIV positive)

... the sensitivity fo the Elisa test is approximatly .993 (so only 7% of people who are truly HIV positive would have a negative test result), while the specificity is approximately .9999 (so only .01% of the people who are truly HIV-negative would have a positive test result).

That sounds pretty good. However, these are not the only numbers to consider when evaluating the appropriateness of random testing. A person who tests positive is interested in a different conditional probability: P(preson is actually HIV-positive | a positive test result). That is, what porportion of people who test positive actually are HIV positive? If the incidence of the disease is low, most positive results could be false positives.
My colleague Lele at UMD also pointed out that this confusion has caused some havoc in the field of Education as well. Here is a paper that proposes to go as far as creating two separate confusion matrices and using lower and upper case notations to avoid the confusion!

Convinced?

Monday, September 22, 2008

Dr. Doom and data mining

Last month The New York Times featured an article about Dr. Doom: Economics professor "Roubini, a respected but formerly obscure academic, has become a major figure in the public debate about the economy: the seer who saw it coming."

This article caught my statistician eye due to the description of "data" and "models". While economists in the article portray Roubini as not using data and econometric models, a careful read shows that he actually does use data and models, but perhaps unusual data and unusual models!

Here are two interesting quotes:
“When I weigh evidence,” he told me, “I’m drawing on 20 years of accumulated experience using models” — but his approach is not the contemporary scholarly ideal in which an economist builds a model in order to constrain his subjective impressions and abide by a discrete set of data.
Later on, Roubini is quoted:
"After analyzing the markets that collapsed in the ’90s, Roubini set out to determine which country’s economy would be the next to succumb to the same pressures."
This might not be data mining per-se, but note that Roubini's approach is at heart similar to the data mining approach: looking at unusual data (here, taking an international view rather than focus on national only) and finding patterns within them that predict economic downfalls. In a standard data mining framework we would of course include also all those markets that have not-collapsed, and then set up the problem as a "direct marketing" problem: who is most likely to fall?

A final note: As a strong believer in the difference between the goals of explaining and forecasting, I think that econometricians should stop limiting their modeling to explanatory, causality-based models. Good forecasters might not be revealing in terms of causality, but in many cases their forecasts will be far more accurate than those from explanatory models!

Wednesday, September 03, 2008

Data conversion and open-source software

Recently I was trying to open a data file that was created in the statistical software SPSS. SPSS is widely used in the social sciences (a competitor to SAS), and appears to have some ground here in Bhutan. Being in Bhutan with slow and erratic internet connection though, I've failed once and again to use the software through our school's portal. Finding the local SPSS representative seemed a bit surreal, and so I went off trying to solve the problem in another way.

First stop: Googling "convert .sav to .csv" lead me nowhere. SPSS and SAS both have an annoying "feature" of keeping data in file formats that are very hard to convert. A few software packages now import data from SAS databases, but I was unable to find a software package that will import from SPSS. This lead me to a surprising finding: PSPP. Yes, that's right: PSPP, previously known as FIASCO, is an open-source "free replacement for the proprietary program, SPSS." The latest version even boasts a graphic user interface. Another interesting feature is described as "Fast statistical procedures, even on very large data sets."

My problem hasn't been solved as yet, because downloading PSPP and the required Cygwin software poses a challenge with my narrow bandwidth... Thus, I cannot report about the usefulness of PSPP. I'd be interested in hearing from others who have tested/used it!

Monday, August 25, 2008

Simpson's Paradox in Bhutan

This year I am on academic sabbatical, hence the lower rate of postings. Moreover, postings this year might have an interesting twist, since I am in Bhutan volunteering at an IT Institute. As part of the effort, I am conducting workshops on various topics on the interface of IT and data analysis. IT is quite at its infancy here in Bhutan, which makes me assess and use IT very differently than I am used to.

My first posting is about Simpson's paradox arising in a Bhutanese context (I will post separately on Simpson's Paradox in the future): The Bhutan Survey of Standards of Living, conducted by the Bhutan National Statistics Bureau, reports statistics on family size, gender of the head-of-family, and rural/urban location. Let's consider the question whether family planning policies should be aimed separately at female- vs. male-headed families, or not. I was able to assemble the following pivot table from their online report:


Now, note the column marginal, where it appears that the average household size is identical for female-headed (4.9985) and male-headed (5.027) households. If you only sliced the data by the gender of the head of family, you might reach the conclusion that the same family planning policy should be used in both cases. Now, examine the figures broken down by urban/rural: Female-headed households are on average smaller than male-headed households in both urban and rural areas! Thus, family planning policies seem to need stronger (or at least different) targeting at male-headed households!

If you are not familiar with Simpson's Paradox you might be puzzled. I will write about the inner workings of this so-called paradox in the near future. Until then, check out Wikipedia...

Wednesday, June 11, 2008

Summaries or graphs?

Herb Edelstein from Two Crows consulting introduced me to this neat example showing how graphs are much more revealing than summary statistics. This is an age-old example by Anscombe (1973). I will show a slightly updated version of Anscombe's example, by Basset et al. (1986):
We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values.

Here are the summary statistics for each of the four Y variables (A, B, C, D):

A B C D
Average 20.95 20.95 20.95 20.95
Std 1.495794 1.495794 1.495794 1.495794


That's right - the mean and standard deviations are all identical. Now let's go one step further and fit the four simple linear regression models Y= a + bX + noise. Remember, the X is the same in all four datasets. Here is the output for the first dataset:
Regression Statistics
Multiple R 0.620844098
R Square 0.385447394
Adjusted R Square 0.317163771
Standard Error 1.236033081
Observations 11



Coefficients Standard Error t Stat P-value
Intercept 18.43 1.12422813 16.39347 5.2E-08
slope 0.28 0.11785113 2.375879 0.041507

Guess what? The other three regression outputs are identical!

So are the four Y variables identical???


Well, here is the answer:


To top it off, Basset included one more dataset that has the exact same summary stats and regression estimates. Here is the scatterplot:


[You can find all the data for both Anscombe's and Basset et al.'s examples here]

Tuesday, June 10, 2008

Resources for instructors of data mining courses in b-schools

With the increasing popularity of data mining courses being offered in business schools (at the MBA and undergraduate levels), a growing number of faculty are becoming involved. Instructors come from diverse backgrounds: statistics, information systems, machine learning, management science, marketing, and more.

Since our textbook Data Mining for Business Intelligence has been out, I've received requests from many instructors to share materials, information, and other resources. At last, I have launched BZST Teaching, a forum for instructors teaching data mining in b-schools. The forum is open only to instructors and a host of materials and communication options are available. It is founded on the idea of sharing with the expectation that members will contribute to the pool.

If you are interested in joining, please email me directly.

Friday, June 06, 2008

Student network launched!

I recently launched a forum for the growing population of MBA students who are veterans of our data mining course. The goal of the forum is to foster networking, job-related communications (not too many data mining-savvy MBAs out there!), to share interesting data analytic stories, and to keep in touch (where are you today?).

Sunday, June 01, 2008

Weighted nearest-neighbors

K-nearest neighbors (k-NN) is a simple yet often powerful classification / prediction method. The basic idea, for predicting a new observation, is to find the k most similar observations in terms of the predictor (X) values, and then let those k neighbors vote to determine the predicted class membership (or take their Y average to predict their numerical outcome). Since this is such an intuitive method, I thought it would be useful to discuss two improvements that have been suggested by data miners. Both use weighting, but in different ways.

One intuitive improvement is to weight the neighbors by their proximity to the observation of interest. In other words, rather than giving each neighbor equal importance in the vote (or average), closer neighbors have higher impact on the prediction.

A second way to use weighting to improve the predictive performance of k-NN is related to predictors: In ordinary k-NN predictors are typically brought to the same scale by normalization, and then treated equally for the purpose of determining proximities of observations. An improvement is therefore to weight the predictors according to their predictive power, such that higher importance is given to more informative predictors. The question is how to assign the weights, or in other words, how to assign predictive power scores to the different predictors. There are a variety of papers out there suggesting different methods. The main approach is to use a different classification/prediction method that yield predictor importance measures (e.g., logistic regression), and then to use those measures in constructing the predictor weights within k-NN.

Tuesday, May 13, 2008

Why zipcodes take over trees

I few weeks ago I went up to West Point to present a talk at their 2008 Statistics Workshop. Another speaker was Professor Wei-Yin Loh, from Univ of Wisconsin. He gave a very interesting talk that touched upon an interesting aspect of classification and regression trees: that of selection bias. Because splits in trees are constructed by trying out all possible variables at all possible values, when a variables with lots and lots of categories is considered (e.g., Zipcode), it will likely get selected! Professor Loh developed his own tree software GUIDE that overcomes this issue. The principle is first to choose which predictors to include (based on chi-square tests of independence), and only after a predictor is chosen, the search for the right split is done.

Monday, April 21, 2008

Good predictions by wrong model?

Are explaining and predicting the same? An age-old debate in philosophy of science started with Hempel & Oppenheim's 1948 paper that equates the logical structure of predicting and explaining (saying that in effect they are the same, except that in explaining the phenomenon already happened while in prediction it hasn't occurred). Later on it was recognized that the two are in fact very different.

When it comes to statistical modeling, how are the two different? Do we model data differently when the goal is to explain than to predict? In a recent paper co-authored with Otto Koppius from Erasmus University, we show how modeling is different in every step.

Let's take the argument to an extreme: Can a wrong model lead to correct predictions? Well, here's an interesting example: Although we know that the ancient Ptolemaic astronomic model, which postulates that the universe revolves around earth, is wrong it turns out that this model generated very good predictions of planet motion, speed, brightness, and sizes as well as eclipse times. The predictions are easy to compute and fairly accurate that they still serve today as engineering approximations and have even been used in navigation until not so long ago.

So how does a wrong model produce good predictions? It's all about the difference between causality and association. A "correct" model is one that identifies the causality structure. But for a good predictive model all we need are good associations!

Tuesday, April 15, 2008

Are conditional probabilities intuitive?

Somewhere in the early 90's I started as a teaching assistant for the "intro to probability" course. Before introducing conditional probabilities, I recall presenting the students with the "Let's make a deal" problem that was supposed to show them that their intuition is often wrong and therefore they should learn about laws of probability, and especially conditional probability and Bayes' Rule. This little motivation game was highlighted in last week's NYT with an extremely cool interactive interface: welcome to the Monty Hall Problem!

The problem is nicely described in Wikipedia:
Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?

The initial thought that crosses one's mind is "it doesn't matter if you switch or not" (i.e. probability of 1/2 that the car is behind each of the two closed doors). Turns out that switching is the optimal strategy: if you switch there's a probability of 2/3 to win the car, but if you stay it's only 1/3.

How can this be? note that the door that the host opens is chosen such that it has a goat behind it. In other words, there is some new information that comes in once the door gets opened. The idea behind the solution is to condition on the information that the door that opened had a goat, and therefore we look at event pairs such as "goat-then-car", "goat-then-goat". In probability language, we move from P(car behind door 1) to P(car behind door 1 GIVEN goat behind door 3).

The Tierney Lab, by NYT's blogger John Tierney, writes about the psychology behind the deception in this game. [Thanks to Thomas Lotze for pointing me to this posting!] He quotes a paper by Fox & Levav (2004) that gets to the core of why people get deceived:
People seem to naturally solve probability puzzles by partitioning the set of possible events {Door 1; Door 2; Door 3}, editing out the possibilities that can be eliminated (the door that was revealed by the host), and counting the remaining possibilities, treating them as equally likely (each of two doors has a ½ probability of containing the prize).
In other words, they ignore the host. And then comes the embarrassing part about asking MBAs who took a probability course, and they too get it wrong. The authors conclude with a suggestion to teach probability differently:
We suggest that introductory probability courses shouldn’t fight this but rather play to these natural intuitions by starting with an explanation of probability in terms of interchangeable events and random sampling.


What does this mean? My interpretation is to use trees when teaching conditional probabilities. Looking at a tree for the Monty Hall game (assuming that you initially choose door 1) shows the asymmetry of the different options and the effect of the car location relative to your initial choice. I agree that trees are a much more intuitive and easy way to compute and understand conditional probabilities. But I'm not sure how to pictorially show Bayes' Rule in an intuitive way. Ideas anyone?

Wednesday, April 02, 2008

Data Mining Cup 2008 releases data today

Although the call for this competition has been out for a while on KDnuggets.com, today is the day when the data and the task description are released. This data mining competition is aimed at students. The prizes probably might not sound that attractive to student ("participation in the KDD 2008, the world's largest international conference for "Knowledge Discovery and Data Mining" (August 24-27, 2008 in Las Vegas)", so I'd say the real prize is cracking the problem and winning!

An interesting related story that I recently heard from Chris Volinsky from the Belkor team (who is currently in first place) is the high level of collaboration that competing teams have been exhibiting during the Netflix Prize. Although you'd think the $1 million would be a sufficient incentive for not sharing, it turns out that the fun of the challenge leads teams to collaborate and share ideas! You can see some of this collaboration on the NetflixPrize Forum.

Thursday, March 06, 2008

Mining voters

While the presidential candidates are still doing their dances, it's interesting to see how they use datamining for improving their stance: The candidates apparently use companies that mine their voter databases in order to "micro-target" voters via ads and the like. See this blog posting on The New Republic-- courtesy of former student Igor Nakshin. Note also the comment about the existence of various such companies that tailor to the different candidates.

It would be interesting to test the impact of this "mining" on actual candidate voting and to compare the different tools. But how can this be done in an objective manner without the companies actually sharing their data? That would fall in the area of "privacy-preserving data mining".

New data repository by UN

As more government and other agencies move "online", some actually make their data publicly available. Adi Gadwale, one of my dedicated ex-students, sent a note about a new neat data repository made publicly available by the UN called UNdata. You can read more about it in the UN News bulletin or go directly to repository at http://data.un.org

The interface is definitely easy to navigate. Lots of time series for the different countries on many types of measurements. This is a good source of data that can be used to supplement other existing datasets (like one would use US census data to supplement demographic information).

Another interesting data repository is TRAC. It's mission is to obtain and provide all information that should be public by the Freedom of Information Act. It has data on many US agencies. Some data are free for download, but to get access to all the neat stuff you (or your institution) need a subscription.

Thursday, February 28, 2008

Forecasting with econometric models

Here's another interesting example where explanatory and predictive tasks create different models: econometric models. These are essentially regression models of the form:

Y(t) = beta0 + beta1 Y(t-1) + beta2 X(t) + beta3 X(t-1) + beta4 Z(t-1) + noise

An example would be forecasting Y(t)= consumer spending at time t, where the input variables can be consumer spending in previous time periods and/or other information that is available at time t or earlier.

In economics, when Y(t) is the state of the economy at time t, there is a distinction between three types of variables (aka "indicators"): Leading, coincident, and lagging variables. Leading indicators are those that change before the economy changes (e.g. the stock market); coincident indicators change during the period when the economy changes (e.g., GDP), and lagging indicators change after the economy changes (e.g., unemployment). -- see about.com.

This distinction is especially revealing when we consider the difference between building an econometric model for the purpose of explaining vs. forecasting. For explaining, you can have both leading and coincident variables as inputs. However, if the purpose is forecasting, the inclusion of coincident variables requires one to forecast them before they can be used to forecast Y(t). An alternative is to lag those variables and include them only in leading-indicator format.

I found a neat example of a leading indicator on thefreedictionary.com: The "Leading Lipstick Indicator"
is based on the theory that a consumer turns to less-expensive indulgences, such as lipstick, when she (or he) feels less than confident about the future. Therefore, lipstick sales tend to increase during times of economic uncertainty or a recession. This term was coined by Leonard Lauder (chairman of Estee Lauder), who consistently found that during tough economic times, his lipstick sales went up. Believe it or not, the indicator has been quite a reliable signal of consumer attitudes over the years. For example, in the months following the Sept 11 terrorist attacks, lipstick sales doubled

Tuesday, February 26, 2008

Data mining competition season

Those who've been following my postings probably recall "competition season" when all of a sudden there are multiple new interesting datasets out there, each framing a business problem that requires the combination of data mining and creativity.

Two such competitions are the SAS Data Mining Shootout and the 2008 Neural Forecasting Competition. The SAS problem concerns revenue management for an airline who wants to improve their customer satisfaction. The NN5 competition is about forecasting cash withdrawals from ATMs.

Here are the similarities between the two competitions: they both provide real data and reasonably real business problems. Now to a more interesting similarity: they both have time series forecasting tasks. From a recent survey on the popularity of types of data mining techniques, it appears that time series are becoming more and more prominent. They also both require registration in order to get access to the data (I didn't compare their terms of use, but that's another interesting comparison), and welcome any type of modeling. Finally, they are both tied to a conference, where competitors can present their results and methods.

What would be really nice is if, like in KDD, the winners' papers would be published online and made publicly available.

Monday, January 28, 2008

Consumer surplus in eBay

A paper that we wrote on "Consumer surplus in online auctions" was recently accepted to the leading journal Information Systems Research. Reuters interviewed us about the paper (Study shows eBay buyers save billions of dollars), which is of special interest these days due to the change in CEO at eBay. Although the economic implications of the paper are interesting and important, the neat methodology is a highlight in itself. So here's what we did:

Consumer surplus is the difference between what a consumer pays and what s/he was willing to pay for an item. eBay can measure the consumer surplus generated in their auction, because they run a second-price auction. This means that the highest bidder wins, but pays only the second highest bid. [I'm always surprised to find out that many people, including eBay users, do not know this!]

So generally speaking, eBay has the info both on what a winner paid and what s/he was willing to bid (if we assume that the highest bid reflects their true willingness-to-pay value). Adding up all the differences between the highest and second highest bids would, say over a certain year, would then (under some assumptions) give the total consumer surplus generated in eBay in that year. The catch is that eBay makes public all bids in an auction besides the highest bid! This is where we came in: we used a website that allows eBay bidders to bid on their behalf during the last seconds of the auction (called a sniping agent). At the time, this website belonged to our co-author Ravi Bapna, who was the originator of this cool idea. For those users who won an eBay auction we then had the highest bid!

In short, the beauty of this paper is in its novel use of technology for quantifying an economic value. [Not to mention the intricate statistical modeling to measure and adjust for different biases]. See our paper for details.

Friday, January 25, 2008

New "predictive tools" from Fair Issac

An interesting piece in the Star Tribune: Fair Isaac hopes its new tools lessen lenders' risk of defaults was sent to me by former student Erik Anderson. Fair Issac is apparently updating their method for computing FICO scores for 2008. According to the article "in the next few weeks [Fair Issac] will roll out a suite of tools designed to predict future default risk". The emphasis is on predicting. In other words, given a database of past credit reports, a model is developed for predicting default risk.

I would be surprised if this is a new methodology. Trying to decipher what really is new is very hard. Erik pointed out the following paragraph (note the huge reported improvement):

"The new tools include revamping the old credit-scoring formula so that it penalizes consumers with a high debt load more than the earlier version. The update, dubbed FICO 08, should increase predictive strength by 5 to 15 percent, according to Fair Isaac's vice president of scoring, Tom Quinn."

So what is new for the 2008 predictor? The inclusion of a new debt load variable? a different binning of debt into categories? a different way for incorporating debt into the model? a new model altogether? Or maybe, simply the model based on the most recent data now includes a parameter estimate that is much higher for debt load than models based on earlier data.

Wednesday, January 16, 2008

Data Mining goes to Broadway!

Data mining is all about being creative. At one of the recent data mining conferences I recall receiving a T-shirt from one of the vendors with the print "Data Mining Rocks!"

Maybe data mining does have the groove: A data mining class at U Fullerton (undergrad business students) instructed by Ofir Turel, has created "Data Mining - The Musical". Check it out for some wild lyrics.

Cycle plots for time series

In his most recent newsletter, Stephen Few from PerceptualEdge presents a short and interesting article on Cycle Plots (by Naomi Robbins). These are plots for visualizing time series, which enhance both cyclical and trend components of the series. Cycle plots were invented by Cleveland, Dunn, and Terpenning in 1978, and seem quite useful. I have not seen them integrated into any visualization tool, although they definitely are useful and easy to interpret. The closest implementation that I've seen (aside from creating them yourself or using one of the macros suggested in the article) is Spotfire DXP's hierarchies. A hierarchy enables one to define some time scales embedded within other time scales, such as "day within week within month within year". One can then plot the time series at any level of hierarchy, thereby supporting the visualization of trends and cycles at different time scales.