A few years ago, Google came out with Google Flu Trends, which monitors web searches for terms that are related to flu, with the underlying assumption that people (or their relatives/friends, etc.) who are experiencing flu-like symptoms would be searching for related terms on the web. Google compared their performance to the weekly diagnostic data by the Centers for Disease Control and Prevention (CDC). In a joint paper between a Google and CDC researchers, they claimed:
we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. (also published in Nature)
Blue= Google flu estimate; Orange=CDC data. From google.org/flutrends/about/how.html |
What can you do if you have an early alert of a disease outbreak? the information can be used for stockpiling medicines, vaccination plans, providing public awareness, preparing hospitals, and more. Now comes the interesting part: recently, there has been criticism of the Google Flu Trends claims, saying that "while Google Flu Trends is highly correlated with rates of [Influenza-like illness], it has a lower correlation with actual influenza tests positive". In other words, Google detects not a flu outbreak, but rather a perception of flu. Does this means that Google Flu Trends is useless? Absolutely not. It just means that the goal and the analysis results must be aligned more carefully. As the Popular Mechanics blog writes:
Google Flu Trends might, however, provide some unique advantages precisely because it is broad and behavior-based. It could help keep track of public fears over an epidemicAligning the question of interest with the data (and analysis method) is related to what Ron Kenett and I call "Information Quality", or "the potential of a dataset to answer a question of interest using a given data analysis method". In the early disease detection problem, the lesson is that diagnostic and pre-diagnostic data should not just be considered two different data sets (monitored perhaps with different statistical methods), but they also differ fundamentally in terms of the questions they can answer.