Saturday, October 18, 2008

Microsoft and the financial downfall

One of the misleading features of Microsoft Office software is that it gives the user the illusion that they are in control of what's visible and what's hidden to readers of the files. One example is copy-pasting from an Excel sheet into a Word or Power Point. If you now double click on the embedded piece you'll see... the Excel file! It is automatically embedded within the Word/Power Point file. A few years ago, after teaching this to MBAs, a student came the following week all excited, telling me how he just detected fraudulent reporting to his company by a contractor. He simply clicked on a pasted Excel chart within the contractor's report written in Word. The embedded Excel file told all the contractor's secrets.

A solution is to "paste special> as picture". But that's only if you know about this!

Another such feature is Excel's "hidden" fields. You can "hide" certain areas on your Excel spreadsheet, but don't be surprised if those areas are not really hidden: Turns out that Barclays Capital just fell in this trap in their proposal of buying the collapsed investment bank Lehman Brothers. This week's article Lehman Excel snafu could cost Barclays dear tells the story of how "a junior law associate at Cleary Gottlieb Steen & Hamilton LLP converted an Excel file into a PDF format document... Some of these details on various trading contracts were marked as hidden because they were not intended to form part of Barclays' proposed deal. However, this "hidden" distinction was ignored during the reformatting process so that Barclays ended up offering to take on an additional 179 contracts as part of its bankruptcy buyout deal".

The moral:
(1) if you have secrets, don't keep them in Microsoft Office.
(2) if you convert your secrets from Microsoft to something safer (like PDF), check the result of the conversion carefully!

Tuesday, October 07, 2008

Sensitivity, specificity, false positive and false negative rates

I recently had an interesting discussion with a few colleagues in Korea regarding the definition of false positive and false negative rates and their relation to sensitivity and specificity. Apparently there is real confusion out there, and if you search the web you'll find conflicting information. So let's sort this out:

Let's assume we have a dataset of bankrupt and solvent firms. We now want to evaluate the performance of a certain model for predicting bankruptcy. Clearly here, the important class is "bankrupt", as the consequences of misclassifying bankrupt firms as solvent are heavier than misclassifying solvent firms as bankrupt. We organize the data in a confusion matrix (aka classification matrix) that crosses actual firm status with predicted status (generated by the model). Say this is the matrix:

In our textbook Data Mining for Business Intelligence we treat the four metrics as two sets of pairs {sensitivity, specificity} and {false positive rate, false negative rate}, each pair measuring a different aspect. Sensitivity and specificity measure the ability of the model to correctly detect the important class (=sensitivity) and its ability to correctly rule out the unimportant class. This definition is apparently not controversial. In the example, the sensitivity would be 201/(201+85) = the proportion of bankrupt firms that the model accurately detects. The model's specificity here is 2689/(2689+25) = the proportion of solvent firms that the model accurately "rules out".

Now to the controversy: We define the false positive rate as the proportion of important class cases incorrectly classified as non-important among all cases predicted as important. In the example the false positive rate would be 25/(201+25). Similarly, the false negative rate is the % of non-important class cases incorrectly classified as important among all cases predicted as non-important (=85/(85+2689). My colleagues, however, disagreed with this definition. According to their definition false positive rate = 1-specificity, and false negative rate = 1-sensitivity.

And indeed, if you search the web you will find conflicting definitions of false positive and negative rates. However, I claim that our definitions are the correct ones. A nice explanation of the difference between the two pairs of metrics is given on p.37 of Chatterjee et al.'s textbook A Casebook for a First Course in Statistics and Data Analysis (a very neat book for beginners, with all ancillaries on Jeff Simonoff's page):

Consider... HIV testing. The standard test is the Wellcome Elisa test. For any diagnostic test...
(1) sensitivity = P(positive test result | person is actually HIV positive)
(2) specificity = P(negative test result | person is actually not HIV positive)

... the sensitivity fo the Elisa test is approximatly .993 (so only 7% of people who are truly HIV positive would have a negative test result), while the specificity is approximately .9999 (so only .01% of the people who are truly HIV-negative would have a positive test result).

That sounds pretty good. However, these are not the only numbers to consider when evaluating the appropriateness of random testing. A person who tests positive is interested in a different conditional probability: P(preson is actually HIV-positive | a positive test result). That is, what porportion of people who test positive actually are HIV positive? If the incidence of the disease is low, most positive results could be false positives.
My colleague Lele at UMD also pointed out that this confusion has caused some havoc in the field of Education as well. Here is a paper that proposes to go as far as creating two separate confusion matrices and using lower and upper case notations to avoid the confusion!