Friday, April 20, 2007

Statistics are not always the blame!

My current MBA student Brenda Martineau showed me a March 15, 2007 article in the Wall Street Journal entitled Stupid Cancer Statistics. Makes you almost think that once again someone is abusing statistics -- but wait! A closer look reveals that the real culprit is not the "mathematical models", but rather the variable that is being measured and analyzed!

According to the article, the main fault is in measuring (and modeling) mortality rate in order to determine the usefulness of breast cancer early screening. Women who get diagnosed early (before the cancer escapes the lung) do not necessarily live longer than those who do not get diagnosed. But their quality of life is much improved. Therefore, the author explaines, the real measure should be quality of life. If I understand this correctly, this really has nothing to do with "faulty statistics", but rather with the choice of measurement to analyze!

In short, although a popular habit, you can't always blame all statistical models all the time...


Scott said...

I found this post particularly interesting because I spent part of my summer internship evaluating the clinical endpoints of major oncology clinical trials. The data I looked at was far less concerned with mortality, siding in favor of endpoints like Progression Free Survival (PFS), Disease Free Survival (DFS), Time to Progression (TTP), Overall Survival and Objective Response Rate (ORR) to name a few. The near countless nuances that can skew each of these endpoints, rendering them “Not statistically validated as surrogate for survival in all settings” according to the FDA, can arise from prior treatments, differences in quality of diagnostic device/kit (reproducibility, reliability and interpretation of diagnostic data is a whole other bag of worms itself) or improvements to supportive care. More so than even picking the right variables this process is convoluted by the extremely difficult nature of running and analyzing oncology survival studies that include long-term follow up periods. Also, keeping in mind the importance of the toxicity profile of a particular therapeutic as well as cost (something that has skyrocketed with the advent of molecular targeted therapies like Avastin, Rituxan and Herceptin (each of which can cost a patient upwards of $100,000 per year or more)). Efficacy and cost are two extremely important factors when evaluating a treatment onthe basis of QALY (quality-adjusted life year), a measure used in the UK to capture treatment effectiveness and overall cost-effectiveness.

On a side note, I did stumble upon a 2008 on-line article titled “Data mining of cancer vaccine trials: a bird's-eye view” ( To quote their conclusion, “We have developed a data mining approach that enables rapid extraction of complex data from the major clinical trial repository. Summarization and visualization of these data represents a cost-effective means of making informed decisions about future cancer vaccine clinical trials.” Of particular interest given some of the recent news on the cancer vaccine front with regards to Gardisil (HPV) and more recently Dendreon’s Provenge (Prostate Cancer).

Galit Shmueli said...

Thanks for your comment Scott. Deciding what to measure is such an important step in medical studies as in other fields (e.g., economics). Measuring "quality of life" has so many aspects to it, as the concept of Gross National Happiness implies. For instance, it was found that happiness is correlated with perceived health and uncorrelated with actual health! (see Harvard president Bok interviewed by Charlie Rose).

The article that you mentioned highlights an interesting use of the term "data mining": It focuses solely on data extraction and visualization, without any supervised or unsupervised algorithms/methods being used. Although I like the study (nice use of Spotfire!), it really falls more under "business intelligence" than "data mining".