BzST | Business Analytics, Statistics, Teaching

Wednesday, December 07, 2011

Polleverywhere.com -- how it worked out

Following up on my earlier post about the use of polleverywhere.com for polling in class, here is a summary of my experience using it in a data mining elective course @ ISB (38 students, after four sessions):

Creating polls: After a few tries and with a few very helpful tips from a PE representative, I was able to create polls and embed them into my Power Point slides. This is relatively easy and user-friendly. One feature that is currently missing in PE, which I use a lot, is the inclusion of a figure on the poll slide (for example, a snippet of some software output). Although you can paste the image on the PPT, it takes a bit of testing to place it so that it does not overlap on the poll. Also, if you need to use the poll in a browser instead of the PPT (see below), the image won't be there...
Operation in class: PE requires good Internet connection for the instructor and for all the users with laptops or using the wireless with a different device. Although wireless is generally operational in the classroom that I used, I did encounter a few times when it was flaky, which is very disruptive (the poll does not load; students cannot respond). Secondly, I found that voting takes much longer with mobile/laptops than with clickers. What would have taken 30 seconds with clickers can take several minutes with PE voting.
Student adoption: During the first session students were curious and quickly figured out how to vote. Students could either vote using a browser (I created the page pollev.com/profgalit where live polls would show up) or those lacking Internet access used their mobiles to tweet via SMS (Airtel free SMS to 53000; other carriers SMS to Bangalore number 09243000111 via smstweet.in). As the sessions progressed, the number of voters started dropping drastically. I suspected that this might be a result of my changing the settings to allow only registered users to vote. So I switched back to "anyone can vote", yet the voting percentage remained very low.

I have never graded voting, and rather use it as a fun active learning tool. With clickers response rate was typically around 80-90%, while with PE it is currently lower than 50%. Given our occasional Internet challenge, the longer voting time, and especially the low response rate I will be going back to clickers for now.

I foresee that PE would work nicely in a setting such as a one-time talk at a large conference, or a one-day workshop for execs. I will also mention the excellent and timely support by PE. And, of course, the low price!

Monday, October 17, 2011

Early detection of what?

The interest in using pre-diagnostic data for the early detection of disease outbreaks, has evolved in interesting ways in the last 10 years. In the early 2000s, I was involved in an effort to explore the potential of non-traditional data sources, such as over-the-counter pharmacy sales and web searches on medical websites, which might give earlier signs of a disease outbreak than confirmed diagnostic data (lab tests, doctor diagnoses, etc.). The pre-diagnostic data sources that we looked at were not only expected to have an earlier footprint of the outbreak compared to traditional diagnostic data, but they were also collected at higher frequency (typically daily) compared to the weekly or even less frequent diagnostic data, and were made available with much less lag time. The general conclusion was that there indeed was potential in improving the detection time using such data (and for that we investigated and developed adequate data analytic methods). Evaluation was based on simulating outbreak footprints, which is a challenge in itself (what does a flu outbreak look like in pharmacy sales?), and on examining past data with known outbreaks (where there is often no consensus on the outbreak start date) -- for papers on these issues see here.

A few years ago, Google came out with Google Flu Trends, which monitors web searches for terms that are related to flu, with the underlying assumption that people (or their relatives/friends, etc.) who are experiencing flu-like symptoms would be searching for related terms on the web. Google compared their performance to the weekly diagnostic data by the Centers for Disease Control and Prevention (CDC). In a joint paper between a Google and CDC researchers, they claimed:

we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. (also published in Nature)

Blue= Google flu estimate; Orange=CDC data. From google.org/flutrends/about/how.html

What can you do if you have an early alert of a disease outbreak? the information can be used for stockpiling medicines, vaccination plans, providing public awareness, preparing hospitals, and more. Now comes the interesting part: recently, there has been criticism of the Google Flu Trends claims, saying that "while Google Flu Trends is highly correlated with rates of [Influenza-like illness], it has a lower correlation with actual influenza tests positive". In other words, Google detects not a flu outbreak, but rather a perception of flu. Does this means that Google Flu Trends is useless? Absolutely not. It just means that the goal and the analysis results must be aligned more carefully. As the Popular Mechanics blog writes:

Google Flu Trends might, however, provide some unique advantages precisely because it is broad and behavior-based. It could help keep track of public fears over an epidemic

Aligning the question of interest with the data (and analysis method) is related to what Ron Kenett and I call "Information Quality", or "the potential of a dataset to answer a question of interest using a given data analysis method". In the early disease detection problem, the lesson is that diagnostic and pre-diagnostic data should not just be considered two different data sets (monitored perhaps with different statistical methods), but they also differ fundamentally in terms of the questions they can answer.

Saturday, October 01, 2011

Language and psychological state: explain or predict?

Quite a few of my social science colleagues think that predictive modeling is not a kosher tool for theory building. In our 2011 MISQ paper "Predictive Analytics in Information Systems Research" we argue that predictive modeling has a critical role to play not only in theory testing but also in theory building. How does it work? Here's an interesting example:

The new book The Secret Life of Pronouns by the cognitive psychologist Pennebaker is a fascinating read in many ways. The book describes how analysis of written language can be predictive of psychological state. In particular, the author describes an interesting text mining approach that analyzes text written by a person and creates a psychological profile of the writer. In the author's context, the approach is used to study the effect of writing on recovery from psychological trauma. You can get a taste of word analysis on the AnalyzeWords.com website, run by the author and his colleagues, which analyzes the personality of a tweeter.

In the book, Pennebaker describes how the automated analysis of language has shed light on the probability that people who underwent psychological trauma will recuperate. For instance, people who used a moderate amount of negative language were more likely to improve than those who used too little or too much negative language. Or, people who tended to change perspectives in their writing over time (from "I" to "they" or "we") were more likely to improve.

Now comes a key question. In the words of the author (p.14): "Do words reflect a psychological state or do they cause it?". The statistical/data-mining text mining application is obviously a predictive tool that is build on correlations/associations. Yet, by examining when it predicts accurately and studying the reasons for the accurate (or inaccurate) predictions, the predictive tool can shed insightful light on possible explanations, linking results to existing psychological theories and giving ideas for new ones. Then comes the "close the circle", where the predictive modeling is combined with explanatory modeling. For testing the explanatory power of words on psychological state, the way to go is experiments. And indeed, the book describes several such experiments investigating the causal effect of words on psychological state, which seem to indicate that there is no causal relationship.

[Thanks to my text-mining-expert colleague Nitin Indurkhya for introducing me to the book!]

Monday, September 19, 2011

Statistical considerations and psychological effects in clinical trials

I find it illuminating to read statistics "bibles" in various fields, which not only open my eyes to different domains, but also present the statistical approach and methods somewhat differently and considering unique domain-specific issues that cause "hmmmm" moments.

The 4th edition of Fundamentals of Clinical Trials, whose authors combine extensive practical experience at NIH and in academia, is full of hmmm moments. In one, the authors mention an important issue related to sampling that I have not encountered in other fields. In clinical trials, the gold standard is to allocate participants to either an intervention or a non-intervention (baseline) group randomly, with equal probabilities. In other words, half the participants receive the intervention and the other half does not (the non-intervention can be a placebo, the traditional treatment, etc.) The authors advocate a 50:50 ratio, because "equal allocation is the most powerful design". While there are reasons to change the ratio in favor of the intervention or baseline groups, equal allocation appears to have an important additional psychological advantage over unequal allocation in clinical trials:

Unequal allocation may indicate to the participants and to their personal physicians that one intervention is preferred over the other (pp. 98-99)

Knowledge of the sample design by the participants and/or the physicians also affects how randomization is carried out. It becomes a game between the designers and the participants and staff, where the two sides have opposing interests: to blur vs. to uncover the group assignments before they are made. This gaming requires devising special randomization methods (which, in turn, require data analysis that takes the randomization mechanism into account).

For example, to assure an equal number of participants in each of the two groups, given that participants enter sequentially, "block randomization" can be used. For instance, to assign 4 people to one of two groups A or B, consider all the possible arrangements AABB, AABA, etc., then choose one sequence at random, and assign participants accordingly. The catch is that if the staff have knowledge that the block size is 4 and know the first three allocations, they automatically know the fourth allocation and can introduce bias by using this knowledge to select every fourth participant.

Where else does such a psychological effect play a role in determining sampling ratios? In applications where participants and other stakeholders have no knowledge of the sampling scheme this is obviously a non-issue. For example, when Amazon or Yahoo! present different information to different users, the users have no idea about the sample design, and maybe not even that they are in an experiment. But how is the randomization achieved? Unless the randomization process is fully automated and not susceptible to reverse engineering, someone in the technical department might decide to favor friends by allocating them to the "better" group...

Thursday, September 15, 2011

Mining health-related data: How to benefit scientific research

Image from KDnuggets.com

While debates over privacy issues related to electronic health records are still ongoing, predictive analytics are beginning to being used with administrative health data (available to health insurance companies, aka, "health provider networks"). One such venue are large data mining contests. Let me describe a few and then get to my point about their contribution to pubic health, medicine and to data mining research.

The latest and grandest is the ongoing $3 million prize contest by Hereitage Provider Network, which opened in 2010 and lasts 2 years. The contest's stated goal is to create "an algorithm that predicts how many days a patient will spend in a hospital in the next year". Participants get a dataset of de-identified medical records of 100,000 individuals, on which they can train their algorithms. The article in KDNuggets.com suggests that this competition's goal is "to spur development of new approaches in the analysis of health data and create new predictive algorithms."

The 2010 SAS Data Mining Shootout contest was also health-related. Unfortunately, the contest webpage is no longer available (the problem description and data were previously available here), and I couldn't find any information on the winning strategies. From an article in KDNuggets:

"analyzing the medical, demographic, and behavioral data of 50,788 individuals, some of whom had diabetes. The task was to determine the economic benefit of reducing the Body Mass Indices (BMIs) of a selected number of individuals by 10% and to determine the cost savings that would accrue to the Federal Government's Medicare and Medicaid programs, as well as to the economy as a whole"

In 2009, the INFORMS data mining contest was co-organized by IBM Research and Health Care Intelligence, focused on "health care quality". Strangely enough, this contest website is also gone. A brief description by the organizer (Claudia Perlich) is given on KDNuggets.com, stating the two goals :

modeling of a patient transfer guideline for patients with a severe medical condition from a community hospital setting to tertiary hospital provider and
assessment of the severity/risk of death of a patient's condition.

What about presentations/reports from the winners? I had a hard time finding any (here is a deck of slides by a group competing in the 2011 SAS Shootout, also health-related). But photos holding awards and checks abound.

If these health-related data mining competitions are to promote research and solutions in these fields, the contest webpages with problem description, data, as well as presentations/reports by the winners should continue to be publicly available (as for the annual KDD Cup competitions by the ACM). Posting only names and photos of the winners makes data mining competitions look more like a consulting job where the data provider is interested in solving one particular problem for its own (financial or other) benefit. There is definitely scope for a data mining group/organization to collect all this info while it is live and post it in one central website.

Wednesday, September 07, 2011

Multiple testing with large samples

Multiple testing (or multiple comparisons) arises when multiple hypotheses are tested using the same dataset via statistical inference. If each test has false alert level α, then the combined false alert rate of testing k hypotheses (also called the "overall type I error rate") can be as large as 1-(1-α)^k (exponential in the number of hypotheses k). This is a serious problem and ignoring it can lead to false discoveries. See an earlier post with links to examples.

There are various proposed corrections for multiple testing, the most basic principle being reducing the individual α's. However, the various corrections suffer in this way or the other from reducing statistical power (the probability of detecting a real effect). One important approach is to limit the number of hypotheses to be tested. All this is not new to statisticians and also to some circles of researchers in other areas (a 2008 technical report by the US department of education nicely summarizes the issue and proposes solutions for education research).

"Large-Scale" = many measurements

The multiple testing challenge has become especially prominent in analyzing micro-array genomic data, where datasets have measurements on many genes (k) for a few people (n). In this new area, inference is used more in an exploratory fashion, rather than confirmatory. The literature on "large-k-small-n" problems has also grown considerably since, including a recent book Large-Scale Inference by Bradley Efron.

And now I get to my (hopefully novel) point: empirical research in the social sciences is now moving to the era of "large n and same old k" datasets. This is what I call "large samples". With large datasets becoming more easily available, researchers test few hypotheses using tens and hundreds of thousands of observations (such as lots of online auctions on eBay or many books on Amazon). Yet, the focus has remained on confirmatory inference, where a set of hypotheses that are derived from a theoretical model are tested using data. What happens to multiple testing issues in this environment? My claim is that they are gone! Decrease α to your liking, and you will still have more statistical power than you can handle.

But wait, it's not so simple: With very large samples, the p-value challenge kicks in, such that we cannot use statistical significance to infer practically significant effects. Even if we decrease α to a tiny number, we'll still likely get lots of statistically-significant-but-practically-meaningless results.

The bottom line is that with large samples (large-n-same-old-k), the approach to analyzing data is totally different: no need to worry about multiple testing, which is so crucial in small samples. This is only one among many other differences between small-sample and large-sample data analysis.

Tuesday, September 06, 2011

"Predict" or "Forecast"?

What is the difference between "prediction" and "forecasting"? I heard this being asked quite a few times lately. The Predictive Analytics World conference website has a Predictive Analytics Guide page with the following Q&A:

How is predictive analytics different from forecasting?

Predictive analytics is something else entirely, going beyond standard forecasting by producing a predictive score for each customer or other organizational element. In contrast, forecasting provides overall aggregate estimates, such as the total number of purchases next quarter. For example, forecasting might estimate the total number of ice cream cones to be purchased in a certain region, while predictive analytics tells you which individual customers are likely to buy an ice cream cone.

In a recent interview on "Data Analytics", Prof Ram Gopal asked me a similar question. I have a slightly different view of the difference: the term "forecasting" is used when it is a time series and we are predicting the series into the future. Hence "business forecasts" and "weather forecasts". In contrast, "prediction" is the act of predicting in a cross-sectional setting, where the data are a snapshot in time (say, a one-time sample from a customer database). Here you use information on a sample of records to predict the value of other records (which can be a value that will be observed in the future). That's my personal distinction.

While forecasting has traditionally focused on providing "overall aggregate estimates", that has long changed, and methods of forecasting are commonly used to provide individual estimates. Think again of weather forecasts -- you can get forecasts for very specific areas. Moreover, daily (and even minute-by-minute) weather forecasts are generated for many different geographical areas. Another example is SKU-level forecasting for inventory management purposes. Stores and large companies often use forecasting to predict every product they carry. These are not aggregate values, but individual-product forecasts.

"Old fashioned" forecasting has indeed been around for a long time, and has been taught in statistics and operations research programs and courses. While some forecasting models require a lot of statistical expertise (such as ARIMA, GARCH and other acronyms), there is a terrific and powerful set of data-driven, computationally fast, automated methods that can be used for forecasting even at the individual product/service level. Forecasting, in my eyes, is definitely part of predictive analytics.