Let's assume we have a dataset of bankrupt and solvent firms. We now want to evaluate the performance of a certain model for predicting bankruptcy. Clearly here, the important class is "bankrupt", as the consequences of misclassifying bankrupt firms as solvent are heavier than misclassifying solvent firms as bankrupt. We organize the data in a confusion matrix (aka classification matrix) that crosses actual firm status with predicted status (generated by the model). Say this is the matrix:
In our textbook Data Mining for Business Intelligence we treat the four metrics as two sets of pairs {sensitivity, specificity} and {false positive rate, false negative rate}, each pair measuring a different aspect. Sensitivity and specificity measure the ability of the model to correctly detect the important class (=sensitivity) and its ability to correctly rule out the unimportant class. This definition is apparently not controversial. In the example, the sensitivity would be 201/(201+85) = the proportion of bankrupt firms that the model accurately detects. The model's specificity here is 2689/(2689+25) = the proportion of solvent firms that the model accurately "rules out".
Now to the controversy: We define the false positive rate as the proportion of important class cases incorrectly classified as non-important among all cases predicted as important. In the example the false positive rate would be 25/(201+25). Similarly, the false negative rate is the % of non-important class cases incorrectly classified as important among all cases predicted as non-important (=85/(85+2689). My colleagues, however, disagreed with this definition. According to their definition false positive rate = 1-specificity, and false negative rate = 1-sensitivity.
And indeed, if you search the web you will find conflicting definitions of false positive and negative rates. However, I claim that our definitions are the correct ones. A nice explanation of the difference between the two pairs of metrics is given on p.37 of Chatterjee et al.'s textbook A Casebook for a First Course in Statistics and Data Analysis (a very neat book for beginners, with all ancillaries on Jeff Simonoff's page):
Consider... HIV testing. The standard test is the Wellcome Elisa test. For any diagnostic test...My colleague Lele at UMD also pointed out that this confusion has caused some havoc in the field of Education as well. Here is a paper that proposes to go as far as creating two separate confusion matrices and using lower and upper case notations to avoid the confusion!
(1) sensitivity = P(positive test result | person is actually HIV positive)
(2) specificity = P(negative test result | person is actually not HIV positive)
... the sensitivity fo the Elisa test is approximatly .993 (so only 7% of people who are truly HIV positive would have a negative test result), while the specificity is approximately .9999 (so only .01% of the people who are truly HIV-negative would have a positive test result).
That sounds pretty good. However, these are not the only numbers to consider when evaluating the appropriateness of random testing. A person who tests positive is interested in a different conditional probability: P(preson is actually HIV-positive | a positive test result). That is, what porportion of people who test positive actually are HIV positive? If the incidence of the disease is low, most positive results could be false positives.
Convinced?
No comments:
Post a Comment