Thursday, December 22, 2016

Key challenges in online experiments: where are the statisticians?

Randomized experiments (or randomized controlled trials, RCT) are a powerful tool for testing causal relationships. Their main principle is random assignment, where subjects or items are assigned randomly to one of the experimental conditions. A classic example is a clinical trial with one or more treatment groups and a no-treatment (control) group, where individuals are assigned at random to one of these groups.

Story 1: (Internet) experiments in industry 

Internet experiments have now become a major activity in giant companies such as Amazon, Google, and Microsoft, in smaller web-based companies, and among academic researchers in management and the social sciences. The buzzword "A/B Testing" refers to the most common and simplest design which includes two groups (A and B), where subjects -- typically users -- are assigned at random to group A or B, and an effect of interest is measured. A/B tests are used for testing anything from the effect of a new website feature on engagement to the effect of a new language translation algorithm on user satisfaction. Companies run lots of experiments all the time. With a large and active user-base, you can run an internet experiment very quickly and quite cheaply. Academic researchers are now also beginning to use large scale randomized experiments to test scientific hypotheses about social and human behavior (as we did in One-Way Mirrors in Online Dating: A Randomized Field Experiment).

Based on our experience in this domain and on what I learned from colleagues and past-students working in such environments, there are multiple critical issues challenging the ability to draw valid conclusions from internet experiments. Here are three:
  1. Contaminated data: Companies constantly conduct online experiments introducing interventions of different types (such as running various promotions, changing website features, and switching underlying technologies). The result is that we never have "clean data" to run an experiment, and we don't know how they are dirty. The data are always somewhat contaminated by other experiments that are taking place in parallel, and in many cases we do not even know which or when such experiments have taken place.
  2. Spill-over effects: in a randomized experiment we assume that each observation/user experiences only one treatment (or control). However, in experiments that involve an intervention such as knowledge sharing (eg, the treatment group receives information about a new service while the control group does not), the treatment might "spill over" to control group members through social networks, online forums, and other information-sharing platforms that are now common. For example, many researchers use Amazon Mechanical Turk to conduct experiments, where, as DynamoWiki describes, "workers" (the experiment subjects) share information, establish norms, and build community through platforms like CloudMeBaby, MTurk Crowd, mTurk Forum, mTurk Grind, Reddit's /r/mturk and /r/HITsWorthTurkingFor, Turker Nation, and Turkopticon. This means that the control group can be "contaminated" by the treatment effect.
  3. Gift effect: Treatments that benefit the treated subjects in some way (such as a special promotion or advanced feature) can confuse the effect of the treatment with the effect of receiving a special treatment. In other words, the difference between the outcome for the treatment and control groups might be not due to the treatment per-se but rather due to the "special attention" the treatment group received by the company or researcher.

Story 2: Statistical discipline of Experimental Design 

Design of Experiments (DOE or DOX) is a subfield of statistics that is focused on creating the most efficient designs for an experiment, and the most appropriate analysis. Efficient here refers to a context where each run is very expensive or resource consuming in some way. Hence, the goal in DOE methodology is to answer the causal questions of interest with the smallest number of runs (observations). The statistical methodological development of DOE was motivated by agricultural applications in the early 20th century, led by the famous Ronald Fisher. DOE methodology gained further momentum in the context of industrial experiments (today it is typically considered part of "industrial statistics").  Currently, the most active research area within DOE is "computer experiments" that is focused on constructing simulations to emulate a physical system for cases where experimentation is impossible, impractical, or terribly expensive (e.g. experimenting on climate).

Do the two stories converge? 

With the current heavy use of online experiments by companies, one would have thought the DOE discipline would flourish: new research problems, plenty of demand from industry for collaboration, troves of new students. Yet, I hear that the number of DOE researchers in US universities is shrinking. Most "business analytics" or "data science" programs do not have a dedicated course on experimental design (with focus on internet experiments). Recent DOE papers in top industrial statistics journals (eg Technometrics) and DOE conferences indicate the burning topics from Story 1 are missing. Academic DOE research by statisticians seems to continue focusing on the scarce data context and on experiments on "things" rather than human subjects. The Wikipedia page on DOE also tells a similar story. I tried to make these points and others in my recent paper Analyzing Behavioral Big Data: Methodological, practical, ethical, and moral issues. Hopefully the paper and this post will encourage DOE researchers to address such burning issues and take the driver's seat in creating designs and analyses for researchers and companies conducting  "big experiments".

Monday, October 24, 2016

Experimenting with quantified self: two months hooked up to a fitness band

It's one thing to collect and analyze behavioral big data (BBD) and another to understand what it means to be the subject of that data. To really understand. Yes, we're all aware that our social network accounts and IoT devices share our private information with large and small companies and other organizations. And although we complain about our privacy, we are forgiving about sharing it, most likely because we really appreciate the benefits.

So, I decided to check out my data sharing in a way that I cannot ignore: I started wearing a fitness band. I bought one of the simplest bands available - the Mi Band Pulse from Xiaomi - this is not a smart watch but "simply" a fitness band that counts steps, measures heart rate and sleep. It's cheap price means that this is likely to spread to many users. Hence, BBD at speed on a diverse population.

I had two questions in mind:
  1. Can I limit the usage so that I only get the benefits I really need/want while avoiding generating and/or sharing data I want to keep private?
  2. What do I not know about the data generated and shared?
I tried to be as "private" as possible, never turning on location, turning on bluetooth for synching my data with my phone only once a day for a few minutes, turning on notifications only for incoming phone calls, not turning on notifications for anything else (no SMS, no third party apps notifications), only using the Mi Fit app (not linking to other apps like Google Fit), and only "befriending" one family member who bought the same product.

While my experiment is not as daring as a physician testing his/her developed drugs on themselves, I did learn a few important lessons. Here is what I discovered:
  • Despite intentionally turning off my phone bluetooth right after synching the band's data every day (which means the device should not be able to connect to my phone, or the cloud), the device and my phone did continue having secret "conversations". I discovered this when my band vibrated on incoming phone calls when bluetooth was turned off. This happened on multiple occasions. I am not sure what part of the technology chain to blame - my Samsung phone? the Android operating system? the Mi Band? It doesn't really matter. The point is: I thought the band was disconnected from my phone, but it was not.
  • Every time I opened the Mi Fit app on my phone, it requested access to location, which I had to decline every time. Quite irritating, and I am guessing many users just turn it on to avoid the irritation.
  • My family member who was my "friend" through the Mi Fit app could see all my data: number of steps, heart rate measurements (whenever I asked the band to measure heart rate), weight (which I entered manually), and even... my sleep hours. Wait: my family member now knew exactly what time I went to sleep and what time I got up every morning. That's something I didn't consider when we decided to become "friends". Plus, Xiaomi has the information about our "relationship" including "pokes" that we could send each other (causing a vibration).
  • The sleep counter claims to measure "deep sleep". It was showing that even if I slept 8 hours and felt wonderfully refreshed, my deep sleep was only 1 hour. At first I was alarmed. With a bit of search I learned that it doesn't mean much. But the alarming effect made me realize that even this simple device is not for everyone. Definitely not for over-worried people.
After 2 months of wearing of the band 24-by-7, what have I learned about myself that I didn't already know? That my "deep sleep" is only one hour per night (whatever that means). And that the feeling of the vibrating band when conquering the 8000 steps/day limit feels good. Addictively good, despite being meaningless. All the other data was really no news (I walk a lot on days when I teach, my weight is quite stable, my heart rate is reasonable, and I go to sleep too late).

To answer my two questions:
  1. Can I limit the usage to only gain my required benefits? Not really. Some is beyond my control, and in some cases I am not aware of what I am sharing.
  2. What do I not know? Quite a bit. I don't know what the device is really measuring (eg "deep sleep"), how it is sharing it (when my bluethooth is off), and what the company does with it.
Is my "behavioral" data useful? To me probably not. To Xiaomi probably yes. One small grain of millet (小米 = XiaoMi) is not much, but a whole sack is valuable.

So what now? I move to Mi Band 2. Its main new feature is a nudge every hour when you're sitting for too long.

Tuesday, April 26, 2016

Statistical software should remove *** notation for statistical significance

Now that the emotional storm following the American Statistical Association's statement on p-values is slowing down (is it? was there even a storm outside of the statistics area?), let's think about a practical issue. One that greatly influences data analysis in most fields: statistical software. Statistical software influences which methods are used and how they are reported. Software companies thus affect entire disciplines and how they progress and communicate.
Star notation for p-value thresholds in statistical software

No matter whether your field uses SAS, SPSS (now IBM), STATA, or another statistical software package, you're likely to have seen the star notation (this isn't about hotel ratings). One star (*) means p-value<0.05, two stars (**) mean p-value<0.01, and three stars (***) mean p-value<0.001.

According to the ASA statement, p-values are not the source of the problem, but rather their discretization. The ASA recommends:

"P-values, when used, would be reported as values, rather than inequalities (p = .0168, rather than p < 0.05). Indeed, we envision there being better recognition that measurement of the strength of evidence really is continuous, rather than discrete."
This statement is a strong signal to the statistical software companies: continuing to use the star notation, even if your users are addicted to them, is in violation of the ASA recommendation. Will we be seeing any change soon?

Thursday, March 24, 2016

A non-traditional definition of Big Data: Big is Relative

I've noticed that in almost every talk or discussion that involves the term Big Data, one of the first slides by the presenter or the first questions to be asked by the audience is "what is Big Data?" The typical answer has to do with some digits, many V's, terms that end with "bytes", or statements about software or hardware capacity.

I beg to differ.

"Big" is relative. It is relative to a certain field, and specifically to the practices in the field. We therefore must consider the benchmark of a specific field to determine if today's data are "Big". My definition of Big Data is therefore data that require a field to change its practices of data processing and analysis.

On the one extreme, consider weather forecasting, where data collection, huge computing power, and algorithms for analyzing huge amounts of data have been around for a long time. So is today's climatology data "Big" for the field of weather forecasting? Probably not, unless you start considering new types of data that the "old" methods cannot process or analyze.

Another example is the field of genetics, where researchers have been working with an analyzing large-scale datasets (notably from the Human Genome Project) for some time. The "Big Data" in this field is about linking different databases and integrating domain knowledge with the patterns found in the data ("As big-data researchers churn through large tumour databases looking for patterns of mutations, they are adding new categories of breast cancer.")

On the other extreme, consider studies in the social sciences, in fields such as political science or psychology that have traditionally relied on 3-digit sample sizes (if you were lucky). In these fields, a sample of 100,000 people is Big Data, because it challenges the methodologies used by researchers in the field.  Here are some of the challenges that arise:

  • Old methods break down: the common method of statistical significance tests for testing theory no longer works, as p-values will tend to be tiny irrespective of practical significance (one more reason to carefully consider the recent statement by the American Statistical Association about the danger of using the "p-value < 0.5" rule.
  • Technology challenge: the statistical software and hardware used by many social science researchers might not be able to handle these new data sizes. Simple operations such as visualizing 100,000 observations in a scatter plot require new practices and software (such as state-of-the-art interactive software packages). 
  • Social science researchers need to learn how to ask more nuanced questions, now that richer data are available to them. 
  • Social scientists are not trained in data mining, yet the new sizes of datasets can allow them to discover patterns that are not hypothesized by theory
In terms of "variety" of data types, Big Data is again area-dependent. While text and network data might be new to engineering fields, social scientists have long had experience with text data (qualitative researchers have analyzed interviews, videos, etc. for long) and with social network data  (the origins of many of the metrics used today are in sociology).

In short, what is Big for one field can be considered small for another field. Big Data is field-dependent, and should be based on the "delta" (the difference) between previous data analysis practices and ones called for by today's data.