Monday, October 17, 2011

Early detection of what?

The interest in using pre-diagnostic data for the early detection of disease outbreaks, has evolved in interesting ways in the last 10 years. In the early 2000s, I was involved in an effort to explore the potential of non-traditional data sources, such as over-the-counter pharmacy sales and web searches on medical websites, which might give earlier signs of a disease outbreak than confirmed diagnostic data (lab tests, doctor diagnoses, etc.). The pre-diagnostic data sources that we looked at were not only expected to have an earlier footprint of the outbreak compared to traditional diagnostic data, but they were also collected at higher frequency (typically daily) compared to the weekly or even less frequent diagnostic data, and were made available with much less lag time. The general conclusion was that there indeed was potential in improving the detection time using such data (and for that we investigated and developed adequate data analytic methods). Evaluation was based on simulating outbreak footprints, which is a challenge in itself (what does a flu outbreak look like in pharmacy sales?), and on examining past data with known outbreaks (where there is often no consensus on the outbreak start date) -- for papers on these issues see here.

A few years ago, Google came out with Google Flu Trends, which monitors web searches for terms that are related to flu, with the underlying assumption that people (or their relatives/friends, etc.) who are experiencing flu-like symptoms would be searching for related terms on the web. Google compared their performance to the weekly diagnostic data by the Centers for Disease Control and Prevention (CDC). In a joint paper between a Google and CDC researchers, they claimed:
we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. (also published in Nature)
Blue= Google flu estimate; Orange=CDC data. From

What can you do if you have an early alert of a disease outbreak? the information can be used for stockpiling medicines, vaccination plans, providing public awareness, preparing hospitals, and more. Now comes the interesting part: recently, there has been criticism of the Google Flu Trends claims, saying that "while Google Flu Trends is highly correlated with rates of [Influenza-like illness], it has a lower correlation with actual influenza tests positive". In other words, Google detects not a flu outbreak, but rather a perception of flu. Does this means that Google Flu Trends is useless? Absolutely not. It just means that the goal and the analysis results must be aligned more carefully. As the Popular Mechanics blog writes:
Google Flu Trends might, however, provide some unique advantages precisely because it is broad and behavior-based. It could help keep track of public fears over an epidemic 
Aligning the question of interest with the data (and analysis method) is related to what Ron Kenett and I call "Information Quality", or "the potential of a dataset to answer a question of interest using a given data analysis method". In the early disease detection problem, the lesson is that diagnostic and pre-diagnostic data should not just be considered two different data sets (monitored perhaps with different statistical methods), but they also differ fundamentally in terms of the questions they can answer.

Saturday, October 01, 2011

Language and psychological state: explain or predict?

Quite a few of my social science colleagues think that predictive modeling is not a kosher tool for theory building. In our 2011 MISQ paper "Predictive Analytics in Information Systems Research" we argue that predictive modeling has a critical role to play not only in theory testing but also in theory building. How does it work? Here's an interesting example:

The new book The Secret Life of Pronouns by the cognitive psychologist Pennebaker is a fascinating read in many ways. The book describes how analysis of written language can be predictive of psychological state. In particular, the author describes an interesting text mining approach that analyzes text written by a person and creates a psychological profile of the writer. In the author's context, the approach is used to study the effect of writing on recovery from psychological trauma. You can get a taste of word analysis on the website, run by the author and his colleagues, which analyzes the personality of a tweeter.

In the book, Pennebaker describes how the automated analysis of language has shed light on the probability that people who underwent psychological trauma will recuperate. For instance, people who used a moderate amount of negative language were more likely to improve than those who used too little or too much negative language. Or, people who tended to change perspectives in their writing over time (from "I" to "they" or "we") were more likely to improve.

Now comes a key question. In the words of the author (p.14): "Do words reflect a psychological state or do they cause it?". The statistical/data-mining text mining application is obviously a predictive tool that is build on correlations/associations. Yet, by examining when it predicts accurately and studying the reasons for the accurate (or inaccurate) predictions, the predictive tool can shed insightful light on possible explanations, linking results to existing psychological theories and giving ideas for new ones. Then comes the "close the circle", where the predictive modeling is combined with explanatory modeling. For testing the explanatory power of words on psychological state, the way to go is experiments. And indeed, the book describes several such experiments investigating the causal effect of words on psychological state, which seem to indicate that there is no causal relationship.

[Thanks to my text-mining-expert colleague Nitin Indurkhya for introducing me to the book!]