BzST | Business Analytics, Statistics, Teaching: infoQ

Showing posts with label infoQ. Show all posts

Monday, December 25, 2017

Election polls: description vs. prediction

My papers To Explain or To Predict and Predictive Analytics in Information Systems Research contrast the process and uses of predictive modeling and causal-explanatory modeling. I briefly mentioned there a third type of modeling: descriptive. However, I haven't expanded on how descriptive modeling differs from the other two types (causal explanation and prediction). While descriptive and predictive modeling both share the reliance on correlations, whereas explanatory modeling relies on causality, the former two are in fact different. Descriptive modeling aims to give a parsimonious statistical representation of a distribution or relationship, whereas predictive modeling aims at generating values for new/future observations.

The recent paper Election Polls—A Survey, A Critique, and Proposals by Kenett, Pfeffermann & Steinberg gives a fantastic illustration of the difference between description and prediction: they explain the different goals of election surveys (such as those conducted by Gallup) as compared to survey-based predictive models such as those by Nate Silver's FiveThirtyEight:

"There is a subtle, but important, difference between reflecting current public sentiment and predicting the results of an election. Surveys [election polls] have focused largely on the former—in other words, on providing a current snapshot of voting preferences, even when asking about voting preference as if elections were carried out on the day of the survey. In that regard, high information quality (InfoQ) surveys are accurately describing current opinions of the electorate. However, the public perception is often focused on projecting the survey results forward in time to election day, which is eventually used to evaluate the performance of election surveys. Moreover, the public often focuses solely on whether the polls got the winner right and not on whether the predicted vote shares were close to the true results."

In other words, whereas the goal of election surveys is to capture the public perception at different points in time prior to the election, they are often judged by the public as failed because of low predictive power on elections day. The authors continue to say:

"Providing an accurate current picture and predicting the ultimate winner are not contradictory goals. As the election approaches, survey results are expected to increasingly point toward the eventual election outcome, and it is natural that the success or failure of the survey methodology and execution is judged by comparing the final polls and trends with the actual election results."

Descriptive models differ from predictive models in another sense that can lead to vastly different results: in a descriptive model for an event of interest we can use the past and the future relative to that event time. For example, to describe spikes in pre-Xmas shopping volume we can use data on pre- and post-Xmas days. In contrast, to predict pre-Xmas shopping volume we can only use information available prior to the pre-Xmas shopping period of interest.

As Kenett et al. (2017) write, description and prediction are not contradictory. They are different, yet the results of descriptive models can provide leads for strong predictors, and potentially for explanatory variables (which require further investigation using explanatory modeling).

The new interesting book Everybody Lies by Stephens-Davidowitz is a great example of descriptive modeling that uncovers correlations that might be used for prediction (or even for future explanatory work). The author uncovers behavioral search patterns on Google by examining keywords search volumes using Google Trends and AdWords. For the recent US elections, the author identifies a specific keyword search term that separates between areas of high performance for Clinton vs. Trump:

"Silver noticed that the areas where Trump performed best made for an odd map. Trump performed well in parts of the Northeast and industrial Midwest, as well as the South. He performed notably worse out West. Silver looked for variables to try to explain this map... Silver found that the single factor that best correlated with Donald Trump's support in the Republican primaries was... [areas] that made the most Google searches for ______."

[I am intentionally leaving the actual keyword blank because it is offensive.]
While finding correlations is a dangerous game that can lead to many false discoveries (two measurements can be correlated because they are both affected by something else, such as weather), careful descriptive modeling tightly coupled with domain expertise can be useful for exploratory research, which later should be tested using explanatory modeling.

While I love Stephens-Davidowitz' idea of using Google Trends to uncover behaviors/thoughts/feelings that are hidden otherwise (because what people say in surveys often diverges from what they really do/think/feel), a main question is who do these search results represent? (sampling bias). But that's a different topic altogether.

Thursday, October 16, 2014

What's in a name? "Data" in Mandarin Chinese

The term "data", now popularly used in many languages, is not as innocent as it seems. The biggest controversy that I've been aware of is whether the English term "data" is singular or plural. The tone of an entire article would be different based on the author's decision.

In Hebrew, the word is in plural (Netunim, with the final "im" signifying plural), so no question arises.

Today I discovered another "data" duality, this time in Mandarin Chinese. In Taiwan, the term used is 資料 (Zīliào), while in Mainland China it is 數據 (Shùjù). Which one to use? What is the difference? I did a little research and tried a few popularity tests:

Google Translate from Chinese to English translates both terms to "data". But Chinese-to-English translates data to 數據 (Shùjù) with the other term appearing as secondary. Here we also learn that 資料 (Zīliào) means "material".
Chinese Wikipedia's main data article (embarrassingly poor) is for 數據 (Shùjù) and the article for 資料 (Zīliào) redirects you to the main article.
A Google search of each term leads to surprising results on number of hits:

Search results for "data" term Zīliào

Search results for "data" term Shùjù

I asked a few colleagues from different Chinese-speaking countries and learned further that 資料 (Zīliào) translates to information. A Google Images search brings images of "Information". This might also explain the double hit rate. A duality between data and information is especially interesting given the relationship between the two (and my related work on Information Quality with Ron Kenett).

So what about Big Data? Here too there appear to be different possible terms, yet the most popular seems to be 大数据 (Dà shùjù), which also has a reasonably respectable Wikipedia article.

Thanks to my learned colleagues Chun-houh Chen (Academia Sinica), Khim Yong Goh (National University of Singapore), and Mingfeng Lin (University of Arizona) for their inputs.

Monday, October 17, 2011

Early detection of what?

The interest in using pre-diagnostic data for the early detection of disease outbreaks, has evolved in interesting ways in the last 10 years. In the early 2000s, I was involved in an effort to explore the potential of non-traditional data sources, such as over-the-counter pharmacy sales and web searches on medical websites, which might give earlier signs of a disease outbreak than confirmed diagnostic data (lab tests, doctor diagnoses, etc.). The pre-diagnostic data sources that we looked at were not only expected to have an earlier footprint of the outbreak compared to traditional diagnostic data, but they were also collected at higher frequency (typically daily) compared to the weekly or even less frequent diagnostic data, and were made available with much less lag time. The general conclusion was that there indeed was potential in improving the detection time using such data (and for that we investigated and developed adequate data analytic methods). Evaluation was based on simulating outbreak footprints, which is a challenge in itself (what does a flu outbreak look like in pharmacy sales?), and on examining past data with known outbreaks (where there is often no consensus on the outbreak start date) -- for papers on these issues see here.

A few years ago, Google came out with Google Flu Trends, which monitors web searches for terms that are related to flu, with the underlying assumption that people (or their relatives/friends, etc.) who are experiencing flu-like symptoms would be searching for related terms on the web. Google compared their performance to the weekly diagnostic data by the Centers for Disease Control and Prevention (CDC). In a joint paper between a Google and CDC researchers, they claimed:

we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. (also published in Nature)

Blue= Google flu estimate; Orange=CDC data. From google.org/flutrends/about/how.html

What can you do if you have an early alert of a disease outbreak? the information can be used for stockpiling medicines, vaccination plans, providing public awareness, preparing hospitals, and more. Now comes the interesting part: recently, there has been criticism of the Google Flu Trends claims, saying that "while Google Flu Trends is highly correlated with rates of [Influenza-like illness], it has a lower correlation with actual influenza tests positive". In other words, Google detects not a flu outbreak, but rather a perception of flu. Does this means that Google Flu Trends is useless? Absolutely not. It just means that the goal and the analysis results must be aligned more carefully. As the Popular Mechanics blog writes:

Google Flu Trends might, however, provide some unique advantages precisely because it is broad and behavior-based. It could help keep track of public fears over an epidemic

Aligning the question of interest with the data (and analysis method) is related to what Ron Kenett and I call "Information Quality", or "the potential of a dataset to answer a question of interest using a given data analysis method". In the early disease detection problem, the lesson is that diagnostic and pre-diagnostic data should not just be considered two different data sets (monitored perhaps with different statistical methods), but they also differ fundamentally in terms of the questions they can answer.