Thursday, November 21, 2013

The Scientific Value of Testing Predictive Performance

This week's NY Times article Risk Calculator for Cholesterol Appears Flawed and CNN article Does calculator overstate heart attack risk? illustrate the power of evaluating the predictive performance of a model for purposes of validating the underlying theory.

The NYT article describes findings by two Harvard Medical School professors, Ridker and Cook, about extreme over-estimation of the 10-year risk of a heart-attack or stroke when using a calculator released by the American Heart Association and the American College of Cardiology.
"According to the new guidelines, if a person's risk is above 7.5%, he or she should be put on a statin." (CNN article)
Over-estimation in this case is likely to lead to over-prescription of therapies such as cholesterol-lowering statin drugs, not to mention the psychological effect of being classified as high risk for a heart-attack or stroke.

How was this over-prediction discovered? 
"Dr. Ridker and Dr. Cook evaluated [the calculator] using three large studies that involved thousands of people and continued for at least a decade. They knew the subjects’ characteristics at the start — their ages, whether they smoked, their cholesterol levels, their blood pressures. Then they asked how many had heart attacks or strokes in the next 10 years and how many would the risk calculator predict."
In other words, the "model" (=calculator) was deployed to a large labeled dataset, and the actual and predicted rates of heart attacks were compared. This is the classic "holdout set" approach. The results are nicely shown in the article's chart, overlaying the actual and predicted values in histograms of risk:

Chart from NY Times article 

Beyond the practical usefulness of detecting the flaw in the calculators, evaluating predictive performance tells us something about the underlying model. A next natural question is "why?", or how was the calculator/model built?

The NYT article quotes Dr. Smith, a professor of medicine at the University of North Carolina and a past president of the American Heart Association:
“a lot of people put a lot of thought into how can we identify people who can benefit from therapy... What we have come forward with represents the best efforts of people who have been working for five years.”
Although this statement seems to imply that the guidelines are based on an informal qualitative integration of domain knowledge and experience, I am guessing (and hoping) that there is a sound data-based model behind the scenes. The fact that the calculator uses very few and coarse predictors makes me suspicious that the model was not designed or optimized for "personalized medicine".

One reason mentioned for the extreme over-prediction of this model on the three studies data is the difference between the population used to "train the calculator" (generate the guidelines) and the population in the evaluation studies in terms of the relationship between heart-attacks/strokes and the risk factors:
"The problem might have stemmed from the fact that the calculator uses as reference points data collected more than a decade ago, when more people smoked and had strokes and heart attacks earlier in life. For example, the guideline makers used data from studies in the 1990s to determine how various risk factors like cholesterol levels and blood pressure led to actual heart attacks and strokes over a decade of observation.
But people have changed in the past few decades, Dr. Blaha said. Among other things, there is no longer such a big gap between women’s risks and those of men at a given age. And people get heart attacks and strokes at older ages."
In predictive analytics, we know that the biggest and sneakiest danger to predictive power is when the training data and conditions differ from the data and conditions at the time of model deployment. While there is no magic bullet, there are some principles and strategies that can help: First, awareness to this weakness. Second, monitoring and evaluating predictive power in different scenarios (robustness/sensitivity analysis) and over time. Third, re-training models over time as new data arrive.

Evaluating predictive power is a very powerful tool. We can learn not only about actual predictive power, but also get clues as to the strengths and weaknesses of the underlying model.

No comments: