BzST | Business Analytics, Statistics, Teaching: design of experiments

Showing posts with label design of experiments. Show all posts

Thursday, December 22, 2016

Key challenges in online experiments: where are the statisticians?

Randomized experiments (or randomized controlled trials, RCT) are a powerful tool for testing causal relationships. Their main principle is random assignment, where subjects or items are assigned randomly to one of the experimental conditions. A classic example is a clinical trial with one or more treatment groups and a no-treatment (control) group, where individuals are assigned at random to one of these groups.

Story 1: (Internet) experiments in industry

Internet experiments have now become a major activity in giant companies such as Amazon, Google, and Microsoft, in smaller web-based companies, and among academic researchers in management and the social sciences. The buzzword "A/B Testing" refers to the most common and simplest design which includes two groups (A and B), where subjects -- typically users -- are assigned at random to group A or B, and an effect of interest is measured. A/B tests are used for testing anything from the effect of a new website feature on engagement to the effect of a new language translation algorithm on user satisfaction. Companies run lots of experiments all the time. With a large and active user-base, you can run an internet experiment very quickly and quite cheaply. Academic researchers are now also beginning to use large scale randomized experiments to test scientific hypotheses about social and human behavior (as we did in One-Way Mirrors in Online Dating: A Randomized Field Experiment).

Based on our experience in this domain and on what I learned from colleagues and past-students working in such environments, there are multiple critical issues challenging the ability to draw valid conclusions from internet experiments. Here are three:

Contaminated data: Companies constantly conduct online experiments introducing interventions of different types (such as running various promotions, changing website features, and switching underlying technologies). The result is that we never have "clean data" to run an experiment, and we don't know how they are dirty. The data are always somewhat contaminated by other experiments that are taking place in parallel, and in many cases we do not even know which or when such experiments have taken place.
Spill-over effects: in a randomized experiment we assume that each observation/user experiences only one treatment (or control). However, in experiments that involve an intervention such as knowledge sharing (eg, the treatment group receives information about a new service while the control group does not), the treatment might "spill over" to control group members through social networks, online forums, and other information-sharing platforms that are now common. For example, many researchers use Amazon Mechanical Turk to conduct experiments, where, as DynamoWiki describes, "workers" (the experiment subjects) share information, establish norms, and build community through platforms like CloudMeBaby, MTurk Crowd, mTurk Forum, mTurk Grind, Reddit's /r/mturk and /r/HITsWorthTurkingFor, Turker Nation, and Turkopticon. This means that the control group can be "contaminated" by the treatment effect.
Gift effect: Treatments that benefit the treated subjects in some way (such as a special promotion or advanced feature) can confuse the effect of the treatment with the effect of receiving a special treatment. In other words, the difference between the outcome for the treatment and control groups might be not due to the treatment per-se but rather due to the "special attention" the treatment group received by the company or researcher.

Story 2: Statistical discipline of Experimental Design

Design of Experiments (DOE or DOX) is a subfield of statistics that is focused on creating the most efficient designs for an experiment, and the most appropriate analysis. Efficient here refers to a context where each run is very expensive or resource consuming in some way. Hence, the goal in DOE methodology is to answer the causal questions of interest with the smallest number of runs (observations). The statistical methodological development of DOE was motivated by agricultural applications in the early 20th century, led by the famous Ronald Fisher. DOE methodology gained further momentum in the context of industrial experiments (today it is typically considered part of "industrial statistics"). Currently, the most active research area within DOE is "computer experiments" that is focused on constructing simulations to emulate a physical system for cases where experimentation is impossible, impractical, or terribly expensive (e.g. experimenting on climate).

Do the two stories converge?

With the current heavy use of online experiments by companies, one would have thought the DOE discipline would flourish: new research problems, plenty of demand from industry for collaboration, troves of new students. Yet, I hear that the number of DOE researchers in US universities is shrinking. Most "business analytics" or "data science" programs do not have a dedicated course on experimental design (with focus on internet experiments). Recent DOE papers in top industrial statistics journals (eg Technometrics) and DOE conferences indicate the burning topics from Story 1 are missing. Academic DOE research by statisticians seems to continue focusing on the scarce data context and on experiments on "things" rather than human subjects. The Wikipedia page on DOE also tells a similar story. I tried to make these points and others in my recent paper Analyzing Behavioral Big Data: Methodological, practical, ethical, and moral issues. Hopefully the paper and this post will encourage DOE researchers to address such burning issues and take the driver's seat in creating designs and analyses for researchers and companies conducting "big experiments".

Friday, August 09, 2013

Predictive relationships and A/B testing

I recently watched an interesting webinar on Seeking the Magic Optimization Metric: When Complex Relationships Between Predictors Lead You Astray by Kelly Uphoff, manager of experimental analytics at Netflix. The presenter mentioned that Netflix is a heavy user of A/B testing for experimentation, and in this talk focused on the goal of optimizing retention.

In ideal A/B testing, the company would test the effect of an intervention of choice (such as displaying a promotion on their website) on retention, by assigning it to a random sample of users, and then comparing retention of the intervention group to that of a control group that was not subject to the intervention. This experimental setup can help infer a causal effect of the treatment on retention. The problem is that the information on retention can take long to measure -- if retention is defined as "customer paid for the next 6 months", you have to wait 6 months before you can determine the outcome.

Statistical considerations and psychological effects in clinical trials

I find it illuminating to read statistics "bibles" in various fields, which not only open my eyes to different domains, but also present the statistical approach and methods somewhat differently and considering unique domain-specific issues that cause "hmmmm" moments.

The 4th edition of Fundamentals of Clinical Trials, whose authors combine extensive practical experience at NIH and in academia, is full of hmmm moments. In one, the authors mention an important issue related to sampling that I have not encountered in other fields. In clinical trials, the gold standard is to allocate participants to either an intervention or a non-intervention (baseline) group randomly, with equal probabilities. In other words, half the participants receive the intervention and the other half does not (the non-intervention can be a placebo, the traditional treatment, etc.) The authors advocate a 50:50 ratio, because "equal allocation is the most powerful design". While there are reasons to change the ratio in favor of the intervention or baseline groups, equal allocation appears to have an important additional psychological advantage over unequal allocation in clinical trials:

Unequal allocation may indicate to the participants and to their personal physicians that one intervention is preferred over the other (pp. 98-99)

Knowledge of the sample design by the participants and/or the physicians also affects how randomization is carried out. It becomes a game between the designers and the participants and staff, where the two sides have opposing interests: to blur vs. to uncover the group assignments before they are made. This gaming requires devising special randomization methods (which, in turn, require data analysis that takes the randomization mechanism into account).

For example, to assure an equal number of participants in each of the two groups, given that participants enter sequentially, "block randomization" can be used. For instance, to assign 4 people to one of two groups A or B, consider all the possible arrangements AABB, AABA, etc., then choose one sequence at random, and assign participants accordingly. The catch is that if the staff have knowledge that the block size is 4 and know the first three allocations, they automatically know the fourth allocation and can introduce bias by using this knowledge to select every fourth participant.

Where else does such a psychological effect play a role in determining sampling ratios? In applications where participants and other stakeholders have no knowledge of the sampling scheme this is obviously a non-issue. For example, when Amazon or Yahoo! present different information to different users, the users have no idea about the sample design, and maybe not even that they are in an experiment. But how is the randomization achieved? Unless the randomization process is fully automated and not susceptible to reverse engineering, someone in the technical department might decide to favor friends by allocating them to the "better" group...

Thursday, July 14, 2011

Designing an experiment on a spatial network: To Explain or To Predict?

Image from
http://www.slews.de

Spatial data are inherently important in environmental applications. An example is collecting data from air or water quality sensors. Such data collection mechanisms introduce dependence in the collected data due to their spatial proximity/distance. This dependence must be taken into account not only in the data analysis stage (and there is a good statistical literature on spatial data analysis methods), but also in the design of experiments stage. One example of a design question is where to locate the sensors and how many sensors are needed?

Where does explain vs. predict come into the picture? An interesting 2006 article by Dale Zimmerman called "Optimal network design for spatial prediction, covariance parameter estimation, and empirical prediction" tells the following story:

"...criteria for network design that emphasize the utility of the network for prediction (kriging) of unobserved responses assuming known spatial covariance parameters are contrasted with criteria that emphasize the estimation of the covariance parameters themselves. It is shown, via a series of related examples, that these two main design objectives are largely antithetical and thus lead to quite different “optimal” designs"

(Here is the freely available technical report).