Monday, November 06, 2017

Statistical test for "no difference"

To most researchers and practitioners using statistical inference, the popular hypothesis testing universe consists of two hypotheses:
H0 is the null hypothesis of "zero effect"
H1 is the alternative hypothesis of "a non-zero effect"

The alternative hypothesis (H1) is typically what the researcher is trying to find: a different outcome for a treatment and control group in an experiment, a regression coefficient that is non-zero, etc. Recently, several independent colleagues have asked me if there's a statistical way to show that an effect is zero, or, that there's no difference between groups. Can we simply use the above setup? The answer is no. Can we simply reverse the hypotheses? Uh-uh, because the "equal" must be in H0.

Minitab has equivalence testing (from
Here's why: In the classic setup the hypotheses are stated about the population of interest, and we take a sample from that population to test them. In this non-symmetrical setup, H0 is assumed to be true unless otherwise proven by the result in the sample. Hypothesis testing has its roots in Karl Popper's falsifiability principle, where any claim about the existence of an effect cannot be made unless it is first shown that a situation of no effect is untenable. This is similar to a democratic justice system, where the defendant is assumed not-guilty unless proven so. The burden of proof lies on the researcher/data. That's why we reject H0 with sufficient evidence or not reject H0, if we don't have sufficient evidence against it. This setup is not designed to arrive at a conclusion that H0 is true.

In a 2013 Letter to the Editor of Journal of Sports Sciences, titled Testing the null hypothesis: the forgotten legacy of Karl Popper? Mick Wilkinson suggests that this setup is the opposite of what a researcher should be doing according to the scientific method, and in fact "Our work should remain driven by conjecture and attempted falsification such that it is always the null hypothesis that is tested. The write up of our studies should make it clear that we are indeed testing the null hypothesis and conforming to the established and accepted philosophical conventions of the scientific method." He therefore suggests the following sequence:

  1. null hypothesis tests are carried out to first establish that a population effect is in fact unlikely to be zero
  2. a confidence-interval based approach estimates what the magnitude of effect might plausibly be
  3. a probability associated with the likelihood of the population effect exceeding an apriori smallest meaningful effect is calculated

While this provides a relevant criticism of the hypothesis testing paradigm, it does not directly provide a test of equivalence! The good news is that equivalence testing is a popular such scenario in pharmacokinetics, arising, for example, when a pharmaceutical wants to show that their developed generic drug is equivalent to a brand drug. This is termed bioequivalence. In other words, H1 is "the drugs are equivalent". The approach used there is the following:

  1. set up an equivalence bound that determines the smallest clinically-meaningful effect size of interest
  2. calculate a confidence interval around the observed effect size (say, difference between the mean outcomes of the generic and brand drugs)
  3. if the confidence interval includes the equivalence bound, then the groups are equivalent; otherwise they are not equivalent
The Wikipedia article on Equivalence Test points out two additional interesting uses of equivalence testing:
  • Avoiding misinterpretation of large p-values in ordinary testing as evidence for H0: "Equivalence tests can be performed in addition to null-hypothesis significance tests. This might prevent common misinterpretations of p-values larger than the alpha level as support for the absence of a true effect."
  • The confidence interval used in equivalence testing can help distinguish between statistical significance and practical/clinical significance: if it includes/excludes the value 0, that indicates statistical insignificance/significance between the groups (or an effect), while if it includes/excludes the equivalence bound that indicates practical insignificance/significance. The four options are shown in the figure.
Statistical vs. practical significance (from
How will sample size affect equivalence testing? We know that in ordinary hypothesis testing a sufficiently large sample will lead to detecting practically insignificant effects by generating a very small p-value, which is bad news for those relying on classic hypothesis testing! My colleague Foster Provost from NYU once challenged me how I could trust a statistical method that breaks down with large samples - a poignant thought that eventually led to my co-authored paper Too Big To Fail: Large Samples and the p-value Problem  (Lin et al., ISR 2013). What about equivalence tests? In equivalence testing, a very large sample will behave properly: with more data we'll get narrower confidence intervals (more certainty). A practically insignificant difference will therefore generate a narrow confidence interval that excludes the equivalence boundary (=equivalence). In contrast, a practically significant difference will generate a narrow confidence interval that completely exceeds the equivalence boundary (non-equivalence).

Tuesday, September 05, 2017

My videos for “Business Analytics using Data Mining” now publicly available!

Five years ago, in 2012, I decided to experiment in improving my teaching by creating a flipped classroom (and semi-MOOC) for my course “Business Analytics Using Data Mining” (BADM) at the Indian School of Business. I initially designed the course at University of Maryland’s Smith School of Business in 2005 and taught it until 2010. When I joined ISB in 2011 I started teaching multiple sections of BADM (which was started by Ravi Bapna in 2006), and the course was fast growing in popularity. Repeating the same lectures in multiple course sections made me realize it was time for scale! I therefore created 30+ videos, covering various supervised methods (k-NN, linear and logistic regression, trees, naive Bayes, etc.) and unsupervised methods (principal components analysis, clustering, association rules), as well as important principles such as performance evaluation, the notion of a holdout set, and more.

I created the videos to support teaching with our textbook “Data Mining for Business Analytics” (the 3rd edition and a SAS JMP edition came out last year; R edition coming out this month!). The videos highlight the key points in different chapters, (hopefully) motivating the watcher to read more in the textbook, which also offers more examples. The videos’ order follows my course teaching, but the topics are mostly independent.

The videos were a big hit in the ISB courses. Since moving to Taiwan, I've created and offered a similar flipped BADM course at National Tsing Hua University, and the videos are also part of the Predictive Analytics series. I’ve since added a few more topics (e.g., neural nets and discriminant analysis).

The audience for the videos (and my courses and textbooks) is non-technical folks who need to understand the logic and uses of data mining, at the managerial level. The videos are therefore about problem solving, and hence the "Business Analytics" in the title. They are different from the many excellent machine learning videos and MOOCs in focus and in technical level -- a basic statistics course that covers linear regression and some business experience should be sufficient for understanding the videos.
For 5 years, and until last week, the videos were only available to past and current students. However, the word spread and many colleagues, instructors, and students have asked me for access. After 5 years, and in celebration of the first R edition of our textbook Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, I decided to make it happen. All 30+ videos are now publicly available on my BADM YouTube playlist.

Currently the videos cater only to those who understand English. I opened the option for community-contributed captions, in the hope that folks will contribute captions in different languages to help make the knowledge propagate further.

This new playlist complements a similar set of videos, on "Business Analytics Using Forecasting" (for time series), that I created at NTHU and made public last year, as part of a MOOC offered on FutureLearn with the next round opening in October.

Finally, I’ll share that I shot these videos while I was living in Bhutan. They are all homemade -- I tried to filter out barking noises and to time the recording when ceremonies were not held close to our home. If you’re interested in how I made the materials and what lessons I learned for flipping my first course, check out my 2012 post.

Tuesday, March 14, 2017

Data mining algorithms: how many dummies?

There's lots of posts on "k-NN for Dummies". This one is about "Dummies for k-NN"

Categorical predictor variables are very common. Those who've taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps:

  1. Convert the categorical variable that has m categories, into m binary dummy variables
  2. Include only m-1 of the dummy variables as predictors in the regression model (the dropped out category is called the reference category)
For example, if we have X={red, yellow, green}, in step 1 we create three dummies:
D_red = 1 if the value is 'red' and 0 otherwise
D_yellow = 1 if the value is 'yellow' and 0 otherwise
D_green = 1 if the value is 'green' and 0 otherwise

In the regression model we might have: Y = b0 + b1 D_red + b2 D_yellow + error
[Note: mathematically, it does not matter which dummy you drop out: the regression coefficients b1, b2
now compare against the left-out category].

When you move to data mining algorithms such as k-NN or trees, the procedure is different: we include all m dummies as predictors when m>2, but in the case m=2, we use a single dummy. Dropping a dummy (when m>2) will distort the distance measure, leading to incorrect distances.
Here's an example, based on X = {red, yellow, green}:

Case 1: m=3 (use 3 dummies)

Here are 3 records, their category (color), and their dummy values on (D_red, D_yellow, D_green):

The distance between each pair of records (in terms of color) should be identical, since all three records are different from each other. Suppose we use Euclidean distance. The distance between each pair of records will be equal to 2. For example:

Distance(#1, #2) = (1-0)^2 + (0-1)^2 + (0-0)^2 = 2.

If we drop one dummy, then the three distances will no longer be identical! For example, if we drop D_green:
Distance(#1, #2) = 1 + 1 = 2
Distance(#1, #3) = 1
Distance(#2, #3) = 1

Case 2: m=2 (use single dummy)

The above problem doesn't happen with m=2. Suppose we have only {red, green}, and use a single dummy. The distance between a pair of records will be 0 if the records are the same color, or 1 if they are different.
Why not use 2 dummies? If we use two dummies, we are doubling the weight of this variable but not adding any information. For example, comparing the red and green records using D_red and D_green would give Distance(#1, #3) = 1 + 1 = 2.

So we end up with distances of 0 or 2 instead of weights of 0 or 1.

Bottom line 

In data mining methods other than regression models (e.g., k-NN, trees, k-means clustering), we use m dummies for a categorical variable with m categories - this is called one-hot encoding. But if m=2 we use a single dummy.