Thursday, December 22, 2016

Key challenges in online experiments: where are the statisticians?

Randomized experiments (or randomized controlled trials, RCT) are a powerful tool for testing causal relationships. Their main principle is random assignment, where subjects or items are assigned randomly to one of the experimental conditions. A classic example is a clinical trial with one or more treatment groups and a no-treatment (control) group, where individuals are assigned at random to one of these groups.

Story 1: (Internet) experiments in industry 

Internet experiments have now become a major activity in giant companies such as Amazon, Google, and Microsoft, in smaller web-based companies, and among academic researchers in management and the social sciences. The buzzword "A/B Testing" refers to the most common and simplest design which includes two groups (A and B), where subjects -- typically users -- are assigned at random to group A or B, and an effect of interest is measured. A/B tests are used for testing anything from the effect of a new website feature on engagement to the effect of a new language translation algorithm on user satisfaction. Companies run lots of experiments all the time. With a large and active user-base, you can run an internet experiment very quickly and quite cheaply. Academic researchers are now also beginning to use large scale randomized experiments to test scientific hypotheses about social and human behavior (as we did in One-Way Mirrors in Online Dating: A Randomized Field Experiment).

Based on our experience in this domain and on what I learned from colleagues and past-students working in such environments, there are multiple critical issues challenging the ability to draw valid conclusions from internet experiments. Here are three:
  1. Contaminated data: Companies constantly conduct online experiments introducing interventions of different types (such as running various promotions, changing website features, and switching underlying technologies). The result is that we never have "clean data" to run an experiment, and we don't know how they are dirty. The data are always somewhat contaminated by other experiments that are taking place in parallel, and in many cases we do not even know which or when such experiments have taken place.
  2. Spill-over effects: in a randomized experiment we assume that each observation/user experiences only one treatment (or control). However, in experiments that involve an intervention such as knowledge sharing (eg, the treatment group receives information about a new service while the control group does not), the treatment might "spill over" to control group members through social networks, online forums, and other information-sharing platforms that are now common. For example, many researchers use Amazon Mechanical Turk to conduct experiments, where, as DynamoWiki describes, "workers" (the experiment subjects) share information, establish norms, and build community through platforms like CloudMeBaby, MTurk Crowd, mTurk Forum, mTurk Grind, Reddit's /r/mturk and /r/HITsWorthTurkingFor, Turker Nation, and Turkopticon. This means that the control group can be "contaminated" by the treatment effect.
  3. Gift effect: Treatments that benefit the treated subjects in some way (such as a special promotion or advanced feature) can confuse the effect of the treatment with the effect of receiving a special treatment. In other words, the difference between the outcome for the treatment and control groups might be not due to the treatment per-se but rather due to the "special attention" the treatment group received by the company or researcher.

Story 2: Statistical discipline of Experimental Design 

Design of Experiments (DOE or DOX) is a subfield of statistics that is focused on creating the most efficient designs for an experiment, and the most appropriate analysis. Efficient here refers to a context where each run is very expensive or resource consuming in some way. Hence, the goal in DOE methodology is to answer the causal questions of interest with the smallest number of runs (observations). The statistical methodological development of DOE was motivated by agricultural applications in the early 20th century, led by the famous Ronald Fisher. DOE methodology gained further momentum in the context of industrial experiments (today it is typically considered part of "industrial statistics").  Currently, the most active research area within DOE is "computer experiments" that is focused on constructing simulations to emulate a physical system for cases where experimentation is impossible, impractical, or terribly expensive (e.g. experimenting on climate).

Do the two stories converge? 

With the current heavy use of online experiments by companies, one would have thought the DOE discipline would flourish: new research problems, plenty of demand from industry for collaboration, troves of new students. Yet, I hear that the number of DOE researchers in US universities is shrinking. Most "business analytics" or "data science" programs do not have a dedicated course on experimental design (with focus on internet experiments). Recent DOE papers in top industrial statistics journals (eg Technometrics) and DOE conferences indicate the burning topics from Story 1 are missing. Academic DOE research by statisticians seems to continue focusing on the scarce data context and on experiments on "things" rather than human subjects. The Wikipedia page on DOE also tells a similar story. I tried to make these points and others in my recent paper Analyzing Behavioral Big Data: Methodological, practical, ethical, and moral issues. Hopefully the paper and this post will encourage DOE researchers to address such burning issues and take the driver's seat in creating designs and analyses for researchers and companies conducting  "big experiments".