Monday, December 25, 2017

Election polls: description vs. prediction

My papers To Explain or To Predict and Predictive Analytics in Information Systems Research contrast the process and uses of predictive modeling and causal-explanatory modeling. I briefly mentioned there a third type of modeling: descriptive. However, I haven't expanded on how descriptive modeling differs from the other two types (causal explanation and prediction). While descriptive and predictive modeling both share the reliance on correlations, whereas explanatory modeling relies on causality, the former two are in fact different. Descriptive modeling aims to give a parsimonious statistical representation of a distribution or relationship, whereas predictive modeling aims at generating values for new/future observations.

The recent paper Election Polls—A Survey, A Critique, and Proposals by Kenett, Pfeffermann & Steinberg gives a fantastic illustration of the difference between description and prediction: they explain the different goals of election surveys (such as those conducted by Gallup) as compared to survey-based predictive models such as those by Nate Silver's FiveThirtyEight:
"There is a subtle, but important, difference between reflecting current public sentiment and predicting the results of an election. Surveys [election polls] have focused largely on the former—in other words, on providing a current snapshot of voting preferences, even when asking about voting preference as if elections were carried out on the day of the survey. In that regard, high information quality (InfoQ) surveys are accurately describing current opinions of the electorate. However, the public perception is often focused on projecting the survey results forward in time to election day, which is eventually used to evaluate the performance of election surveys. Moreover, the public often focuses solely on whether the polls got the winner right and not on whether the predicted vote shares were close to the true results."
In other words, whereas the goal of election surveys is to capture the public perception at different points in time prior to the election, they are often judged by the public as failed because of low predictive power on elections day. The authors continue to say:
"Providing an accurate current picture and predicting the ultimate winner are not contradictory goals. As the election approaches, survey results are expected to increasingly point toward the eventual election outcome, and it is natural that the success or failure of the survey methodology and execution is judged by comparing the final polls and trends with the actual election results."
Descriptive models differ from predictive models in another sense that can lead to vastly different results: in a descriptive model for an event of interest we can use the past and the future relative to that event time. For example, to describe spikes in pre-Xmas shopping volume we can use data on pre- and post-Xmas days. In contrast, to predict pre-Xmas shopping volume we can only use information available prior to the pre-Xmas shopping period of interest.

As Kenett et al. (2017) write, description and prediction are not contradictory. They are different, yet the results of descriptive models can provide leads for strong predictors, and potentially for explanatory variables (which require further investigation using explanatory modeling).

The new interesting book Everybody Lies by Stephens-Davidowitz is a great example of descriptive modeling that uncovers correlations that might be used for prediction (or even for future explanatory work). The author uncovers behavioral search patterns on Google by examining keywords search volumes using Google Trends and AdWords. For the recent US elections, the author identifies a specific keyword search term that separates between areas of high performance for Clinton vs. Trump:
"Silver noticed that the areas where Trump performed best made for an odd map. Trump performed well in parts of the Northeast and industrial Midwest, as well as the South. He performed notably worse out West. Silver looked for variables to try to explain this map... Silver found that the single factor that best correlated with Donald Trump's support in the Republican primaries was... [areas] that made the most Google searches for ______."
[I am intentionally leaving the actual keyword blank because it is offensive.]
While finding correlations is a dangerous game that can lead to many false discoveries (two measurements can be correlated because they are both affected by something else, such as weather), careful descriptive modeling tightly coupled with domain expertise can be useful for exploratory research, which later should be tested using explanatory modeling.

While I love Stephens-Davidowitz' idea of using Google Trends to uncover behaviors/thoughts/feelings that are hidden otherwise (because what people say in surveys often diverges from what they really do/think/feel), a main question is who do these search results represent? (sampling bias). But that's a different topic altogether.

Monday, November 06, 2017

Statistical test for "no difference"

To most researchers and practitioners using statistical inference, the popular hypothesis testing universe consists of two hypotheses:
H0 is the null hypothesis of "zero effect"
H1 is the alternative hypothesis of "a non-zero effect"

The alternative hypothesis (H1) is typically what the researcher is trying to find: a different outcome for a treatment and control group in an experiment, a regression coefficient that is non-zero, etc. Recently, several independent colleagues have asked me if there's a statistical way to show that an effect is zero, or, that there's no difference between groups. Can we simply use the above setup? The answer is no. Can we simply reverse the hypotheses? Uh-uh, because the "equal" must be in H0.

Minitab has equivalence testing (from
Here's why: In the classic setup the hypotheses are stated about the population of interest, and we take a sample from that population to test them. In this non-symmetrical setup, H0 is assumed to be true unless otherwise proven by the result in the sample. Hypothesis testing has its roots in Karl Popper's falsifiability principle, where any claim about the existence of an effect cannot be made unless it is first shown that a situation of no effect is untenable. This is similar to a democratic justice system, where the defendant is assumed not-guilty unless proven so. The burden of proof lies on the researcher/data. That's why we reject H0 with sufficient evidence or not reject H0, if we don't have sufficient evidence against it. This setup is not designed to arrive at a conclusion that H0 is true.

In a 2013 Letter to the Editor of Journal of Sports Sciences, titled Testing the null hypothesis: the forgotten legacy of Karl Popper? Mick Wilkinson suggests that this setup is the opposite of what a researcher should be doing according to the scientific method, and in fact "Our work should remain driven by conjecture and attempted falsification such that it is always the null hypothesis that is tested. The write up of our studies should make it clear that we are indeed testing the null hypothesis and conforming to the established and accepted philosophical conventions of the scientific method." He therefore suggests the following sequence:

  1. null hypothesis tests are carried out to first establish that a population effect is in fact unlikely to be zero
  2. a confidence-interval based approach estimates what the magnitude of effect might plausibly be
  3. a probability associated with the likelihood of the population effect exceeding an apriori smallest meaningful effect is calculated

While this provides a relevant criticism of the hypothesis testing paradigm, it does not directly provide a test of equivalence! The good news is that equivalence testing is a popular such scenario in pharmacokinetics, arising, for example, when a pharmaceutical wants to show that their developed generic drug is equivalent to a brand drug. This is termed bioequivalence. In other words, H1 is "the drugs are equivalent". The approach used there is the following:

  1. set up an equivalence bound that determines the smallest clinically-meaningful effect size of interest
  2. calculate a confidence interval around the observed effect size (say, difference between the mean outcomes of the generic and brand drugs)
  3. if the confidence interval includes the equivalence bound, then the groups are equivalent; otherwise they are not equivalent
The Wikipedia article on Equivalence Test points out two additional interesting uses of equivalence testing:
  • Avoiding misinterpretation of large p-values in ordinary testing as evidence for H0: "Equivalence tests can be performed in addition to null-hypothesis significance tests. This might prevent common misinterpretations of p-values larger than the alpha level as support for the absence of a true effect."
  • The confidence interval used in equivalence testing can help distinguish between statistical significance and practical/clinical significance: if it includes/excludes the value 0, that indicates statistical insignificance/significance between the groups (or an effect), while if it includes/excludes the equivalence bound that indicates practical insignificance/significance. The four options are shown in the figure.
Statistical vs. practical significance (from
How will sample size affect equivalence testing? We know that in ordinary hypothesis testing a sufficiently large sample will lead to detecting practically insignificant effects by generating a very small p-value, which is bad news for those relying on classic hypothesis testing! My colleague Foster Provost from NYU once challenged me how I could trust a statistical method that breaks down with large samples - a poignant thought that eventually led to my co-authored paper Too Big To Fail: Large Samples and the p-value Problem  (Lin et al., ISR 2013). What about equivalence tests? In equivalence testing, a very large sample will behave properly: with more data we'll get narrower confidence intervals (more certainty). A practically insignificant difference will therefore generate a narrow confidence interval that excludes the equivalence boundary (=equivalence). In contrast, a practically significant difference will generate a narrow confidence interval that completely exceeds the equivalence boundary (non-equivalence).

Tuesday, September 05, 2017

My videos for “Business Analytics using Data Mining” now publicly available!

Five years ago, in 2012, I decided to experiment in improving my teaching by creating a flipped classroom (and semi-MOOC) for my course “Business Analytics Using Data Mining” (BADM) at the Indian School of Business. I initially designed the course at University of Maryland’s Smith School of Business in 2005 and taught it until 2010. When I joined ISB in 2011 I started teaching multiple sections of BADM (which was started by Ravi Bapna in 2006), and the course was fast growing in popularity. Repeating the same lectures in multiple course sections made me realize it was time for scale! I therefore created 30+ videos, covering various supervised methods (k-NN, linear and logistic regression, trees, naive Bayes, etc.) and unsupervised methods (principal components analysis, clustering, association rules), as well as important principles such as performance evaluation, the notion of a holdout set, and more.

I created the videos to support teaching with our textbook “Data Mining for Business Analytics” (the 3rd edition and a SAS JMP edition came out last year; R edition coming out this month!). The videos highlight the key points in different chapters, (hopefully) motivating the watcher to read more in the textbook, which also offers more examples. The videos’ order follows my course teaching, but the topics are mostly independent.

The videos were a big hit in the ISB courses. Since moving to Taiwan, I've created and offered a similar flipped BADM course at National Tsing Hua University, and the videos are also part of the Predictive Analytics series. I’ve since added a few more topics (e.g., neural nets and discriminant analysis).

The audience for the videos (and my courses and textbooks) is non-technical folks who need to understand the logic and uses of data mining, at the managerial level. The videos are therefore about problem solving, and hence the "Business Analytics" in the title. They are different from the many excellent machine learning videos and MOOCs in focus and in technical level -- a basic statistics course that covers linear regression and some business experience should be sufficient for understanding the videos.
For 5 years, and until last week, the videos were only available to past and current students. However, the word spread and many colleagues, instructors, and students have asked me for access. After 5 years, and in celebration of the first R edition of our textbook Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, I decided to make it happen. All 30+ videos are now publicly available on my BADM YouTube playlist.

Currently the videos cater only to those who understand English. I opened the option for community-contributed captions, in the hope that folks will contribute captions in different languages to help make the knowledge propagate further.

This new playlist complements a similar set of videos, on "Business Analytics Using Forecasting" (for time series), that I created at NTHU and made public last year, as part of a MOOC offered on FutureLearn with the next round opening in October.

Finally, I’ll share that I shot these videos while I was living in Bhutan. They are all homemade -- I tried to filter out barking noises and to time the recording when ceremonies were not held close to our home. If you’re interested in how I made the materials and what lessons I learned for flipping my first course, check out my 2012 post.

Tuesday, March 14, 2017

Data mining algorithms: how many dummies?

There's lots of posts on "k-NN for Dummies". This one is about "Dummies for k-NN"

Categorical predictor variables are very common. Those who've taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps:

  1. Convert the categorical variable that has m categories, into m binary dummy variables
  2. Include only m-1 of the dummy variables as predictors in the regression model (the dropped out category is called the reference category)
For example, if we have X={red, yellow, green}, in step 1 we create three dummies:
D_red = 1 if the value is 'red' and 0 otherwise
D_yellow = 1 if the value is 'yellow' and 0 otherwise
D_green = 1 if the value is 'green' and 0 otherwise

In the regression model we might have: Y = b0 + b1 D_red + b2 D_yellow + error
[Note: mathematically, it does not matter which dummy you drop out: the regression coefficients b1, b2
now compare against the left-out category].

When you move to data mining algorithms such as k-NN or trees, the procedure is different: we include all m dummies as predictors when m>2, but in the case m=2, we use a single dummy. Dropping a dummy (when m>2) will distort the distance measure, leading to incorrect distances.
Here's an example, based on X = {red, yellow, green}:

Case 1: m=3 (use 3 dummies)

Here are 3 records, their category (color), and their dummy values on (D_red, D_yellow, D_green):

The distance between each pair of records (in terms of color) should be identical, since all three records are different from each other. Suppose we use Euclidean distance. The distance between each pair of records will be equal to 2. For example:

Distance(#1, #2) = (1-0)^2 + (0-1)^2 + (0-0)^2 = 2.

If we drop one dummy, then the three distances will no longer be identical! For example, if we drop D_green:
Distance(#1, #2) = 1 + 1 = 2
Distance(#1, #3) = 1
Distance(#2, #3) = 1

Case 2: m=2 (use single dummy)

The above problem doesn't happen with m=2. Suppose we have only {red, green}, and use a single dummy. The distance between a pair of records will be 0 if the records are the same color, or 1 if they are different.
Why not use 2 dummies? If we use two dummies, we are doubling the weight of this variable but not adding any information. For example, comparing the red and green records using D_red and D_green would give Distance(#1, #3) = 1 + 1 = 2.

So we end up with distances of 0 or 2 instead of weights of 0 or 1.

Bottom line 

In data mining methods other than regression models (e.g., k-NN, trees, k-means clustering), we use m dummies for a categorical variable with m categories - this is called one-hot encoding. But if m=2 we use a single dummy.