One of the misleading features of Microsoft Office software is that it gives the user the illusion that they are in control of what's visible and what's hidden to readers of the files. One example is copy-pasting from an Excel sheet into a Word or Power Point. If you now double click on the embedded piece you'll see... the Excel file! It is automatically embedded within the Word/Power Point file. A few years ago, after teaching this to MBAs, a student came the following week all excited, telling me how he just detected fraudulent reporting to his company by a contractor. He simply clicked on a pasted Excel chart within the contractor's report written in Word. The embedded Excel file told all the contractor's secrets.
A solution is to "paste special> as picture". But that's only if you know about this!
Another such feature is Excel's "hidden" fields. You can "hide" certain areas on your Excel spreadsheet, but don't be surprised if those areas are not really hidden: Turns out that Barclays Capital just fell in this trap in their proposal of buying the collapsed investment bank Lehman Brothers. This week's article Lehman Excel snafu could cost Barclays dear tells the story of how "a junior law associate at Cleary Gottlieb Steen & Hamilton LLP converted an Excel file into a PDF format document... Some of these details on various trading contracts were marked as hidden because they were not intended to form part of Barclays' proposed deal. However, this "hidden" distinction was ignored during the reformatting process so that Barclays ended up offering to take on an additional 179 contracts as part of its bankruptcy buyout deal".
The moral:
(1) if you have secrets, don't keep them in Microsoft Office.
(2) if you convert your secrets from Microsoft to something safer (like PDF), check the result of the conversion carefully!
Saturday, October 18, 2008
Tuesday, October 07, 2008
Sensitivity, specificity, false positive and false negative rates
I recently had an interesting discussion with a few colleagues in Korea regarding the definition of false positive and false negative rates and their relation to sensitivity and specificity. Apparently there is real confusion out there, and if you search the web you'll find conflicting information. So let's sort this out:
Let's assume we have a dataset of bankrupt and solvent firms. We now want to evaluate the performance of a certain model for predicting bankruptcy. Clearly here, the important class is "bankrupt", as the consequences of misclassifying bankrupt firms as solvent are heavier than misclassifying solvent firms as bankrupt. We organize the data in a confusion matrix (aka classification matrix) that crosses actual firm status with predicted status (generated by the model). Say this is the matrix:

In our textbook Data Mining for Business Intelligence we treat the four metrics as two sets of pairs {sensitivity, specificity} and {false positive rate, false negative rate}, each pair measuring a different aspect. Sensitivity and specificity measure the ability of the model to correctly detect the important class (=sensitivity) and its ability to correctly rule out the unimportant class. This definition is apparently not controversial. In the example, the sensitivity would be 201/(201+85) = the proportion of bankrupt firms that the model accurately detects. The model's specificity here is 2689/(2689+25) = the proportion of solvent firms that the model accurately "rules out".
Now to the controversy: We define the false positive rate as the proportion of important class cases incorrectly classified as non-important among all cases predicted as important. In the example the false positive rate would be 25/(201+25). Similarly, the false negative rate is the % of non-important class cases incorrectly classified as important among all cases predicted as non-important (=85/(85+2689). My colleagues, however, disagreed with this definition. According to their definition false positive rate = 1-specificity, and false negative rate = 1-sensitivity.
And indeed, if you search the web you will find conflicting definitions of false positive and negative rates. However, I claim that our definitions are the correct ones. A nice explanation of the difference between the two pairs of metrics is given on p.37 of Chatterjee et al.'s textbook A Casebook for a First Course in Statistics and Data Analysis (a very neat book for beginners, with all ancillaries on Jeff Simonoff's page):
Convinced?
Let's assume we have a dataset of bankrupt and solvent firms. We now want to evaluate the performance of a certain model for predicting bankruptcy. Clearly here, the important class is "bankrupt", as the consequences of misclassifying bankrupt firms as solvent are heavier than misclassifying solvent firms as bankrupt. We organize the data in a confusion matrix (aka classification matrix) that crosses actual firm status with predicted status (generated by the model). Say this is the matrix:

In our textbook Data Mining for Business Intelligence we treat the four metrics as two sets of pairs {sensitivity, specificity} and {false positive rate, false negative rate}, each pair measuring a different aspect. Sensitivity and specificity measure the ability of the model to correctly detect the important class (=sensitivity) and its ability to correctly rule out the unimportant class. This definition is apparently not controversial. In the example, the sensitivity would be 201/(201+85) = the proportion of bankrupt firms that the model accurately detects. The model's specificity here is 2689/(2689+25) = the proportion of solvent firms that the model accurately "rules out".
Now to the controversy: We define the false positive rate as the proportion of important class cases incorrectly classified as non-important among all cases predicted as important. In the example the false positive rate would be 25/(201+25). Similarly, the false negative rate is the % of non-important class cases incorrectly classified as important among all cases predicted as non-important (=85/(85+2689). My colleagues, however, disagreed with this definition. According to their definition false positive rate = 1-specificity, and false negative rate = 1-sensitivity.
And indeed, if you search the web you will find conflicting definitions of false positive and negative rates. However, I claim that our definitions are the correct ones. A nice explanation of the difference between the two pairs of metrics is given on p.37 of Chatterjee et al.'s textbook A Casebook for a First Course in Statistics and Data Analysis (a very neat book for beginners, with all ancillaries on Jeff Simonoff's page):
Consider... HIV testing. The standard test is the Wellcome Elisa test. For any diagnostic test...My colleague Lele at UMD also pointed out that this confusion has caused some havoc in the field of Education as well. Here is a paper that proposes to go as far as creating two separate confusion matrices and using lower and upper case notations to avoid the confusion!
(1) sensitivity = P(positive test result | person is actually HIV positive)
(2) specificity = P(negative test result | person is actually not HIV positive)
... the sensitivity fo the Elisa test is approximatly .993 (so only 7% of people who are truly HIV positive would have a negative test result), while the specificity is approximately .9999 (so only .01% of the people who are truly HIV-negative would have a positive test result).
That sounds pretty good. However, these are not the only numbers to consider when evaluating the appropriateness of random testing. A person who tests positive is interested in a different conditional probability: P(preson is actually HIV-positive | a positive test result). That is, what porportion of people who test positive actually are HIV positive? If the incidence of the disease is low, most positive results could be false positives.
Convinced?
Monday, September 22, 2008
Dr. Doom and data mining
Last month The New York Times featured an article about Dr. Doom: Economics professor "Roubini, a respected but formerly obscure academic, has become a major figure in the public debate about the economy: the seer who saw it coming."
This article caught my statistician eye due to the description of "data" and "models". While economists in the article portray Roubini as not using data and econometric models, a careful read shows that he actually does use data and models, but perhaps unusual data and unusual models!
Here are two interesting quotes:
A final note: As a strong believer in the difference between the goals of explaining and forecasting, I think that econometricians should stop limiting their modeling to explanatory, causality-based models. Good forecasters might not be revealing in terms of causality, but in many cases their forecasts will be far more accurate than those from explanatory models!
This article caught my statistician eye due to the description of "data" and "models". While economists in the article portray Roubini as not using data and econometric models, a careful read shows that he actually does use data and models, but perhaps unusual data and unusual models!
Here are two interesting quotes:
“When I weigh evidence,” he told me, “I’m drawing on 20 years of accumulated experience using models” — but his approach is not the contemporary scholarly ideal in which an economist builds a model in order to constrain his subjective impressions and abide by a discrete set of data.Later on, Roubini is quoted:
"After analyzing the markets that collapsed in the ’90s, Roubini set out to determine which country’s economy would be the next to succumb to the same pressures."This might not be data mining per-se, but note that Roubini's approach is at heart similar to the data mining approach: looking at unusual data (here, taking an international view rather than focus on national only) and finding patterns within them that predict economic downfalls. In a standard data mining framework we would of course include also all those markets that have not-collapsed, and then set up the problem as a "direct marketing" problem: who is most likely to fall?
A final note: As a strong believer in the difference between the goals of explaining and forecasting, I think that econometricians should stop limiting their modeling to explanatory, causality-based models. Good forecasters might not be revealing in terms of causality, but in many cases their forecasts will be far more accurate than those from explanatory models!
Wednesday, September 03, 2008
Data conversion and open-source software
Recently I was trying to open a data file that was created in the statistical software SPSS. SPSS is widely used in the social sciences (a competitor to SAS), and appears to have some ground here in Bhutan. Being in Bhutan with slow and erratic internet connection though, I've failed once and again to use the software through our school's portal. Finding the local SPSS representative seemed a bit surreal, and so I went off trying to solve the problem in another way.
First stop: Googling "convert .sav to .csv" lead me nowhere. SPSS and SAS both have an annoying "feature" of keeping data in file formats that are very hard to convert. A few software packages now import data from SAS databases, but I was unable to find a software package that will import from SPSS. This lead me to a surprising finding: PSPP. Yes, that's right: PSPP, previously known as FIASCO, is an open-source "free replacement for the proprietary program, SPSS." The latest version even boasts a graphic user interface. Another interesting feature is described as "Fast statistical procedures, even on very large data sets."
My problem hasn't been solved as yet, because downloading PSPP and the required Cygwin software poses a challenge with my narrow bandwidth... Thus, I cannot report about the usefulness of PSPP. I'd be interested in hearing from others who have tested/used it!
First stop: Googling "convert .sav to .csv" lead me nowhere. SPSS and SAS both have an annoying "feature" of keeping data in file formats that are very hard to convert. A few software packages now import data from SAS databases, but I was unable to find a software package that will import from SPSS. This lead me to a surprising finding: PSPP. Yes, that's right: PSPP, previously known as FIASCO, is an open-source "free replacement for the proprietary program, SPSS." The latest version even boasts a graphic user interface. Another interesting feature is described as "Fast statistical procedures, even on very large data sets."
My problem hasn't been solved as yet, because downloading PSPP and the required Cygwin software poses a challenge with my narrow bandwidth... Thus, I cannot report about the usefulness of PSPP. I'd be interested in hearing from others who have tested/used it!
Monday, August 25, 2008
Simpson's Paradox in Bhutan
This year I am on academic sabbatical, hence the lower rate of postings. Moreover, postings this year might have an interesting twist, since I am in Bhutan volunteering at an IT Institute. As part of the effort, I am conducting workshops on various topics on the interface of IT and data analysis. IT is quite at its infancy here in Bhutan, which makes me assess and use IT very differently than I am used to.
My first posting is about Simpson's paradox arising in a Bhutanese context (I will post separately on Simpson's Paradox in the future): The Bhutan Survey of Standards of Living, conducted by the Bhutan National Statistics Bureau, reports statistics on family size, gender of the head-of-family, and rural/urban location. Let's consider the question whether family planning policies should be aimed separately at female- vs. male-headed families, or not. I was able to assemble the following pivot table from their online report:

Now, note the column marginal, where it appears that the average household size is identical for female-headed (4.9985) and male-headed (5.027) households. If you only sliced the data by the gender of the head of family, you might reach the conclusion that the same family planning policy should be used in both cases. Now, examine the figures broken down by urban/rural: Female-headed households are on average smaller than male-headed households in both urban and rural areas! Thus, family planning policies seem to need stronger (or at least different) targeting at male-headed households!
If you are not familiar with Simpson's Paradox you might be puzzled. I will write about the inner workings of this so-called paradox in the near future. Until then, check out Wikipedia...
My first posting is about Simpson's paradox arising in a Bhutanese context (I will post separately on Simpson's Paradox in the future): The Bhutan Survey of Standards of Living, conducted by the Bhutan National Statistics Bureau, reports statistics on family size, gender of the head-of-family, and rural/urban location. Let's consider the question whether family planning policies should be aimed separately at female- vs. male-headed families, or not. I was able to assemble the following pivot table from their online report:

Now, note the column marginal, where it appears that the average household size is identical for female-headed (4.9985) and male-headed (5.027) households. If you only sliced the data by the gender of the head of family, you might reach the conclusion that the same family planning policy should be used in both cases. Now, examine the figures broken down by urban/rural: Female-headed households are on average smaller than male-headed households in both urban and rural areas! Thus, family planning policies seem to need stronger (or at least different) targeting at male-headed households!
If you are not familiar with Simpson's Paradox you might be puzzled. I will write about the inner workings of this so-called paradox in the near future. Until then, check out Wikipedia...

Wednesday, June 11, 2008
Summaries or graphs?
Herb Edelstein from Two Crows consulting introduced me to this neat example showing how graphs are much more revealing than summary statistics. This is an age-old example by Anscombe (1973). I will show a slightly updated version of Anscombe's example, by Basset et al. (1986):
We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values.
Here are the summary statistics for each of the four Y variables (A, B, C, D):
That's right - the mean and standard deviations are all identical. Now let's go one step further and fit the four simple linear regression models Y= a + bX + noise. Remember, the X is the same in all four datasets. Here is the output for the first dataset:
Guess what? The other three regression outputs are identical!
So are the four Y variables identical???
Well, here is the answer:


To top it off, Basset included one more dataset that has the exact same summary stats and regression estimates. Here is the scatterplot:

[You can find all the data for both Anscombe's and Basset et al.'s examples here]
We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values.
Here are the summary statistics for each of the four Y variables (A, B, C, D):
A | B | C | D | |
Average | 20.95 | 20.95 | 20.95 | 20.95 |
Std | 1.495794 | 1.495794 | 1.495794 | 1.495794 |
That's right - the mean and standard deviations are all identical. Now let's go one step further and fit the four simple linear regression models Y= a + bX + noise. Remember, the X is the same in all four datasets. Here is the output for the first dataset:
Regression Statistics | |
Multiple R | 0.620844098 |
R Square | 0.385447394 |
Adjusted R Square | 0.317163771 |
Standard Error | 1.236033081 |
Observations | 11 |
Coefficients | Standard Error | t Stat | P-value | |
Intercept | 18.43 | 1.12422813 | 16.39347 | 5.2E-08 |
slope | 0.28 | 0.11785113 | 2.375879 | 0.041507 |
Guess what? The other three regression outputs are identical!
So are the four Y variables identical???
Well, here is the answer:


To top it off, Basset included one more dataset that has the exact same summary stats and regression estimates. Here is the scatterplot:

[You can find all the data for both Anscombe's and Basset et al.'s examples here]
Tuesday, June 10, 2008
Resources for instructors of data mining courses in b-schools
With the increasing popularity of data mining courses being offered in business schools (at the MBA and undergraduate levels), a growing number of faculty are becoming involved. Instructors come from diverse backgrounds: statistics, information systems, machine learning, management science, marketing, and more.
Since our textbook Data Mining for Business Intelligence has been out, I've received requests from many instructors to share materials, information, and other resources. At last, I have launched BZST Teaching, a forum for instructors teaching data mining in b-schools. The forum is open only to instructors and a host of materials and communication options are available. It is founded on the idea of sharing with the expectation that members will contribute to the pool.
If you are interested in joining, please email me directly.
Since our textbook Data Mining for Business Intelligence has been out, I've received requests from many instructors to share materials, information, and other resources. At last, I have launched BZST Teaching, a forum for instructors teaching data mining in b-schools. The forum is open only to instructors and a host of materials and communication options are available. It is founded on the idea of sharing with the expectation that members will contribute to the pool.
If you are interested in joining, please email me directly.
Subscribe to:
Posts (Atom)