Friday, August 09, 2013

Predictive relationships and A/B testing

I recently watched an interesting webinar on Seeking the Magic Optimization Metric: When Complex Relationships Between Predictors Lead You Astray by Kelly Uphoff, manager of experimental analytics at Netflix. The presenter mentioned that Netflix is a heavy user of A/B testing for experimentation, and in this talk focused on the goal of optimizing retention.

In ideal A/B testing, the company would test the effect of an intervention of choice (such as displaying a promotion on their website) on retention, by assigning it to a random sample of users, and then comparing retention of the intervention group to that of a control group that was not subject to the intervention. This experimental setup can help infer a causal effect of the treatment on retention. The problem is that the information on retention can take long to measure -- if retention is defined as "customer paid for the next 6 months", you have to wait 6 months before you can determine the outcome.

The majority of the talk was therefore devoted to the task of finding a "secondary metric", or a proxy measurement, for customer retention, which is the "primary metric". This secondary metric is then to be used in the A/B testing in place of actual retention. Ms. Uphoff described the search for a good secondary metric by looking for a measurements that is predictive of retention. The slides describe the use of predictive measures, such as ROC curves, and predictive algorithms, such as random forests for finding a good secondary metric. The focus of the talk, however, is on the challenges that arise when using a particular secondary metric, "Fraction of Content Viewed" (FCV), in A/B testing. This metric showed high predictive power for retention prediction. Yet, the main challenge for using it in A/B testing is the non-monotone* relationship between FCV and retention. Intuitively, retention should be positively correlated with FCV, because satisfied customers would have high FCV due to "being served content that they like". Hence, treatments that maximize FCV are likely to maximize retention. However, it turns out that the relationship between FCV and retention is non-monotone:
while for most customers higher retention rates are associated with higher FCV levels, for a subset of customers, who watch a single movie from start to end but then leave, this relationship is reversed. Thus, maximizing FCV does not necessarily maximize retention.

Thus far, I summarized the gist of the talk (and hopefully did it justice). But now, I want to raise a critical question that was not discussed: the distinction between inferring causality and achieving predictive accuracy. While A/B testing is aimed at detecting the average effect of an intervention on an outcome, predictive analytics are aimed at predicting the outcome for an individual customer. By measuring a secondary metric for the intervention and control groups (FCV), which is predictive of the primary metric (retention), what does it tell us about the intervention effect on the primary metric?

My initial thoughts are the following: Inferring a causal effect of an intervention on a secondary metric (FCV) does not guarantee that the same effect applies to the desired primary metric (retention). Suppose the promotional ad increases the fraction of content viewed (FCV) by an average of 1%, so that the intervention group has an FCV average 1% higher than the control group. And suppose that the sample size is sufficiently large that this difference is statistically significant. What does this guarantee in terms of retention rate for the intervention vs. control? Why would the causal relationship between intervention and the secondary metric imply a causal relationship between intervention and the primary metric? The problem arises due to the methodology used to link the primary and secondary metrics, which is based only on predictive power. The uncertainty associated with the A/B testing has to do with estimating the average intervention effect (we compare the average FCV of the intervention and control populations). In contrast, the uncertainty associated with the predictive association between FCV and retention is observation-level uncertainty.

At this point, the key question is: what is the goal of the A/B testing? Do we care about the average effect of an intervention on the primary metric or will we be using it to predict the primary metric for individual customers? Is it a retrospective task or a prospective one? Clarifying the type of goal (causal-explanatory vs. predictive) is the first step. Given the type of goal, we can determine the type of relationship needed between the primary and secondary metrics.

*Although Ms. Uphoff used the terms linear and non-linear to describe the relationship between FCV and retention, I believe she meant monotone and non-monotone.

No comments: