Tuesday, March 14, 2017

Data mining algorithms: how many dummies?

There's lots of posts on "k-NN for Dummies". This one is about "Dummies for k-NN"

Categorical predictor variables are very common. Those who've taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps:

  1. Convert the categorical variable that has m categories, into m binary dummy variables
  2. Include only m-1 of the dummy variables as predictors in the regression model (the dropped out category is called the reference category)
For example, if we have X={red, yellow, green}, in step 1 we create three dummies:
D_red = 1 if the value is 'red' and 0 otherwise
D_yellow = 1 if the value is 'yellow' and 0 otherwise
D_green = 1 if the value is 'green' and 0 otherwise

In the regression model we might have: Y = b0 + b1 D_red + b2 D_yellow + error
[Note: mathematically, it does not matter which dummy you drop out: the regression coefficients b1, b2
now compare against the left-out category].

When you move to data mining algorithms such as k-NN or trees, the procedure is different: we include all m dummies as predictors when m>2, but in the case m=2, we use a single dummy. Dropping a dummy (when m>2) will distort the distance measure, leading to incorrect distances.
Here's an example, based on X = {red, yellow, green}:

Case 1: m=3 (use 3 dummies)

Here are 3 records, their category (color), and their dummy values on (D_red, D_yellow, D_green):

The distance between each pair of records (in terms of color) should be identical, since all three records are different from each other. Suppose we use Euclidean distance. The distance between each pair of records will be equal to 2. For example:

Distance(#1, #2) = (1-0)^2 + (0-1)^2 + (0-0)^2 = 2.

If we drop one dummy, then the three distances will no longer be identical! For example, if we drop D_green:
Distance(#1, #2) = 1 + 1 = 2
Distance(#1, #3) = 1
Distance(#2, #3) = 1

Case 2: m=2 (use single dummy)

The above problem doesn't happen with m=2. Suppose we have only {red, green}, and use a single dummy. The distance between a pair of records will be 0 if the records are the same color, or 1 if they are different.
Why not use 2 dummies? If we use two dummies, we are doubling the weight of this variable but not adding any information. For example, comparing the red and green records using D_red and D_green would give Distance(#1, #3) = 1 + 1 = 2.

So we end up with distances of 0 or 2 instead of weights of 0 or 1.

Bottom line 

In data mining methods other than regression models (e.g., k-NN, trees, k-means clustering), we use m dummies for a categorical variable with m categories - this is called one-hot encoding. But if m=2 we use a single dummy.

Thursday, December 22, 2016

Key challenges in online experiments: where are the statisticians?

Randomized experiments (or randomized controlled trials, RCT) are a powerful tool for testing causal relationships. Their main principle is random assignment, where subjects or items are assigned randomly to one of the experimental conditions. A classic example is a clinical trial with one or more treatment groups and a no-treatment (control) group, where individuals are assigned at random to one of these groups.

Story 1: (Internet) experiments in industry 

Internet experiments have now become a major activity in giant companies such as Amazon, Google, and Microsoft, in smaller web-based companies, and among academic researchers in management and the social sciences. The buzzword "A/B Testing" refers to the most common and simplest design which includes two groups (A and B), where subjects -- typically users -- are assigned at random to group A or B, and an effect of interest is measured. A/B tests are used for testing anything from the effect of a new website feature on engagement to the effect of a new language translation algorithm on user satisfaction. Companies run lots of experiments all the time. With a large and active user-base, you can run an internet experiment very quickly and quite cheaply. Academic researchers are now also beginning to use large scale randomized experiments to test scientific hypotheses about social and human behavior (as we did in One-Way Mirrors in Online Dating: A Randomized Field Experiment).

Based on our experience in this domain and on what I learned from colleagues and past-students working in such environments, there are multiple critical issues challenging the ability to draw valid conclusions from internet experiments. Here are three:
  1. Contaminated data: Companies constantly conduct online experiments introducing interventions of different types (such as running various promotions, changing website features, and switching underlying technologies). The result is that we never have "clean data" to run an experiment, and we don't know how they are dirty. The data are always somewhat contaminated by other experiments that are taking place in parallel, and in many cases we do not even know which or when such experiments have taken place.
  2. Spill-over effects: in a randomized experiment we assume that each observation/user experiences only one treatment (or control). However, in experiments that involve an intervention such as knowledge sharing (eg, the treatment group receives information about a new service while the control group does not), the treatment might "spill over" to control group members through social networks, online forums, and other information-sharing platforms that are now common. For example, many researchers use Amazon Mechanical Turk to conduct experiments, where, as DynamoWiki describes, "workers" (the experiment subjects) share information, establish norms, and build community through platforms like CloudMeBaby, MTurk Crowd, mTurk Forum, mTurk Grind, Reddit's /r/mturk and /r/HITsWorthTurkingFor, Turker Nation, and Turkopticon. This means that the control group can be "contaminated" by the treatment effect.
  3. Gift effect: Treatments that benefit the treated subjects in some way (such as a special promotion or advanced feature) can confuse the effect of the treatment with the effect of receiving a special treatment. In other words, the difference between the outcome for the treatment and control groups might be not due to the treatment per-se but rather due to the "special attention" the treatment group received by the company or researcher.

Story 2: Statistical discipline of Experimental Design 

Design of Experiments (DOE or DOX) is a subfield of statistics that is focused on creating the most efficient designs for an experiment, and the most appropriate analysis. Efficient here refers to a context where each run is very expensive or resource consuming in some way. Hence, the goal in DOE methodology is to answer the causal questions of interest with the smallest number of runs (observations). The statistical methodological development of DOE was motivated by agricultural applications in the early 20th century, led by the famous Ronald Fisher. DOE methodology gained further momentum in the context of industrial experiments (today it is typically considered part of "industrial statistics").  Currently, the most active research area within DOE is "computer experiments" that is focused on constructing simulations to emulate a physical system for cases where experimentation is impossible, impractical, or terribly expensive (e.g. experimenting on climate).

Do the two stories converge? 

With the current heavy use of online experiments by companies, one would have thought the DOE discipline would flourish: new research problems, plenty of demand from industry for collaboration, troves of new students. Yet, I hear that the number of DOE researchers in US universities is shrinking. Most "business analytics" or "data science" programs do not have a dedicated course on experimental design (with focus on internet experiments). Recent DOE papers in top industrial statistics journals (eg Technometrics) and DOE conferences indicate the burning topics from Story 1 are missing. Academic DOE research by statisticians seems to continue focusing on the scarce data context and on experiments on "things" rather than human subjects. The Wikipedia page on DOE also tells a similar story. I tried to make these points and others in my recent paper Analyzing Behavioral Big Data: Methodological, practical, ethical, and moral issues. Hopefully the paper and this post will encourage DOE researchers to address such burning issues and take the driver's seat in creating designs and analyses for researchers and companies conducting  "big experiments".

Monday, October 24, 2016

Experimenting with quantified self: two months hooked up to a fitness band

It's one thing to collect and analyze behavioral big data (BBD) and another to understand what it means to be the subject of that data. To really understand. Yes, we're all aware that our social network accounts and IoT devices share our private information with large and small companies and other organizations. And although we complain about our privacy, we are forgiving about sharing it, most likely because we really appreciate the benefits.

So, I decided to check out my data sharing in a way that I cannot ignore: I started wearing a fitness band. I bought one of the simplest bands available - the Mi Band Pulse from Xiaomi - this is not a smart watch but "simply" a fitness band that counts steps, measures heart rate and sleep. It's cheap price means that this is likely to spread to many users. Hence, BBD at speed on a diverse population.

I had two questions in mind:
  1. Can I limit the usage so that I only get the benefits I really need/want while avoiding generating and/or sharing data I want to keep private?
  2. What do I not know about the data generated and shared?
I tried to be as "private" as possible, never turning on location, turning on bluetooth for synching my data with my phone only once a day for a few minutes, turning on notifications only for incoming phone calls, not turning on notifications for anything else (no SMS, no third party apps notifications), only using the Mi Fit app (not linking to other apps like Google Fit), and only "befriending" one family member who bought the same product.

While my experiment is not as daring as a physician testing his/her developed drugs on themselves, I did learn a few important lessons. Here is what I discovered:
  • Despite intentionally turning off my phone bluetooth right after synching the band's data every day (which means the device should not be able to connect to my phone, or the cloud), the device and my phone did continue having secret "conversations". I discovered this when my band vibrated on incoming phone calls when bluetooth was turned off. This happened on multiple occasions. I am not sure what part of the technology chain to blame - my Samsung phone? the Android operating system? the Mi Band? It doesn't really matter. The point is: I thought the band was disconnected from my phone, but it was not.
  • Every time I opened the Mi Fit app on my phone, it requested access to location, which I had to decline every time. Quite irritating, and I am guessing many users just turn it on to avoid the irritation.
  • My family member who was my "friend" through the Mi Fit app could see all my data: number of steps, heart rate measurements (whenever I asked the band to measure heart rate), weight (which I entered manually), and even... my sleep hours. Wait: my family member now knew exactly what time I went to sleep and what time I got up every morning. That's something I didn't consider when we decided to become "friends". Plus, Xiaomi has the information about our "relationship" including "pokes" that we could send each other (causing a vibration).
  • The sleep counter claims to measure "deep sleep". It was showing that even if I slept 8 hours and felt wonderfully refreshed, my deep sleep was only 1 hour. At first I was alarmed. With a bit of search I learned that it doesn't mean much. But the alarming effect made me realize that even this simple device is not for everyone. Definitely not for over-worried people.
After 2 months of wearing of the band 24-by-7, what have I learned about myself that I didn't already know? That my "deep sleep" is only one hour per night (whatever that means). And that the feeling of the vibrating band when conquering the 8000 steps/day limit feels good. Addictively good, despite being meaningless. All the other data was really no news (I walk a lot on days when I teach, my weight is quite stable, my heart rate is reasonable, and I go to sleep too late).

To answer my two questions:
  1. Can I limit the usage to only gain my required benefits? Not really. Some is beyond my control, and in some cases I am not aware of what I am sharing.
  2. What do I not know? Quite a bit. I don't know what the device is really measuring (eg "deep sleep"), how it is sharing it (when my bluethooth is off), and what the company does with it.
Is my "behavioral" data useful? To me probably not. To Xiaomi probably yes. One small grain of millet (小米 = XiaoMi) is not much, but a whole sack is valuable.

So what now? I move to Mi Band 2. Its main new feature is a nudge every hour when you're sitting for too long.

Tuesday, April 26, 2016

Statistical software should remove *** notation for statistical significance

Now that the emotional storm following the American Statistical Association's statement on p-values is slowing down (is it? was there even a storm outside of the statistics area?), let's think about a practical issue. One that greatly influences data analysis in most fields: statistical software. Statistical software influences which methods are used and how they are reported. Software companies thus affect entire disciplines and how they progress and communicate.
Star notation for p-value thresholds in statistical software

No matter whether your field uses SAS, SPSS (now IBM), STATA, or another statistical software package, you're likely to have seen the star notation (this isn't about hotel ratings). One star (*) means p-value<0.05, two stars (**) mean p-value<0.01, and three stars (***) mean p-value<0.001.

According to the ASA statement, p-values are not the source of the problem, but rather their discretization. The ASA recommends:

"P-values, when used, would be reported as values, rather than inequalities (p = .0168, rather than p < 0.05). Indeed, we envision there being better recognition that measurement of the strength of evidence really is continuous, rather than discrete."
This statement is a strong signal to the statistical software companies: continuing to use the star notation, even if your users are addicted to them, is in violation of the ASA recommendation. Will we be seeing any change soon?

Thursday, March 24, 2016

A non-traditional definition of Big Data: Big is Relative

I've noticed that in almost every talk or discussion that involves the term Big Data, one of the first slides by the presenter or the first questions to be asked by the audience is "what is Big Data?" The typical answer has to do with some digits, many V's, terms that end with "bytes", or statements about software or hardware capacity.

I beg to differ.

"Big" is relative. It is relative to a certain field, and specifically to the practices in the field. We therefore must consider the benchmark of a specific field to determine if today's data are "Big". My definition of Big Data is therefore data that require a field to change its practices of data processing and analysis.

On the one extreme, consider weather forecasting, where data collection, huge computing power, and algorithms for analyzing huge amounts of data have been around for a long time. So is today's climatology data "Big" for the field of weather forecasting? Probably not, unless you start considering new types of data that the "old" methods cannot process or analyze.

Another example is the field of genetics, where researchers have been working with an analyzing large-scale datasets (notably from the Human Genome Project) for some time. The "Big Data" in this field is about linking different databases and integrating domain knowledge with the patterns found in the data ("As big-data researchers churn through large tumour databases looking for patterns of mutations, they are adding new categories of breast cancer.")

On the other extreme, consider studies in the social sciences, in fields such as political science or psychology that have traditionally relied on 3-digit sample sizes (if you were lucky). In these fields, a sample of 100,000 people is Big Data, because it challenges the methodologies used by researchers in the field.  Here are some of the challenges that arise:

  • Old methods break down: the common method of statistical significance tests for testing theory no longer works, as p-values will tend to be tiny irrespective of practical significance (one more reason to carefully consider the recent statement by the American Statistical Association about the danger of using the "p-value < 0.5" rule.
  • Technology challenge: the statistical software and hardware used by many social science researchers might not be able to handle these new data sizes. Simple operations such as visualizing 100,000 observations in a scatter plot require new practices and software (such as state-of-the-art interactive software packages). 
  • Social science researchers need to learn how to ask more nuanced questions, now that richer data are available to them. 
  • Social scientists are not trained in data mining, yet the new sizes of datasets can allow them to discover patterns that are not hypothesized by theory
In terms of "variety" of data types, Big Data is again area-dependent. While text and network data might be new to engineering fields, social scientists have long had experience with text data (qualitative researchers have analyzed interviews, videos, etc. for long) and with social network data  (the origins of many of the metrics used today are in sociology).

In short, what is Big for one field can be considered small for another field. Big Data is field-dependent, and should be based on the "delta" (the difference) between previous data analysis practices and ones called for by today's data.

Monday, December 07, 2015

Predictive analytics in the long term

Ten years ago, micro-level prediction the way we know it today, was nearly absent in companies. MBAs learned about data analysis mostly in a requires statistics course, which covered mostly statistical inference and descriptive modeling. At the time, I myself was learning my way into the predictive world, and designed the first Data Mining course at University of Maryland's Smith School of Business (which is running successfully until today!). When I realized the gap, I started giving talks about the benefits of predictive analytics and its uses. And I've designed and taught a bunch of predictive analytics courses/programs around the world (USA, India, Taiwan) and online (Statistics.com). I should have been delighted at the sight of predictive analytics being so pervasively used in industry just ten years later. But the truth is: I am alarmed.

A recent Harvard Business Review article Don't Let Big Data Bury Your Brand touches on one aspect of predictive analytics usage to be alarmed about: companies do not realize that machine-learning-based predictive analytics can be excellent for short-term prediction, but poor in the long-term. The HBR article talks about the scenario of a CMO torn between the CEO's pressure to push prediction-based promotions (based on the IT department's data analysts), and his/her long-term brand-building efforts:
Advanced marketing analytics and big data make [balancing short-term revenue pursuit and long-term brand building] much harder today. If it was difficult before to defend branding investments with indefinite and distant payoffs, it is doubly so now that near-term sales can be so precisely engineered. Analytics allows a seeming omniscience about what promotional offers customers will find appealing. Big data allows impressive amounts of information to be obtained about the buying patterns and transaction histories of identifiable customers. Given marketing dollars and the discretion to invest them in either direction, the temptation to keep cash registers ringing is nearly irresistible. 
There are two reasons for the weakness of prediction in the long term: First, predictive analytics learn from the past to predict the future. In a dynamic setting where the future is very different from the past, predictions will obviously fail. Second, predictive analytics rely on correlations and associations between the inputs and the to-be-predicted output, not on causal relationships. While correlations can work well in the short term, they are much more sensitive in the long term.

Relying on correlations is not a bad thing, even though the typical statistician will give you the derogative look of "correlation is not causation". Correlations a very useful for short term prediction. They are a fast and useful proxy for assessing the similarity of things, when all we care about is whether they are similar or not. Predictive analytics tells us what to do. But they don't tell us why. And in the long term, we often need to know why in order to devise proper predictions, scenarios, and policies.

The danger is then using predictive analytics for long-term prediction or planning. It's a good tool, but it has its limits. Prediction becomes much more valuable when it is combined with explanation. The good news is that establishing causality is also possible with Big Data: you run experiments (the now-popular A/B testing is a simple experiment), or you rely on other causal expert knowledge. There are even methods that use Big Data to quantify causal relationships from observational data, but they are trickier and more commonly used in academia than in practice (that will come!).

Bottom line: we need a combination of causal modeling and predictive modeling in order to make use of data for short-term and long-term actions and planning.  The predictive toolkit can help discover correlations; we can then use experiments (or surveys) to figure out why. And then improve our long-term predictions. It's a cycle.

Wednesday, August 19, 2015

Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Recently I've had discussions with several instructors of data mining courses about a fact that is often left out of many books, but is quite important: different treatment of dummy variables in different data mining methods.

From http://blog.excelmasterseries.com
Statistics courses that cover linear or logistic regression teach us to be careful when including a categorical predictor variable in our model. Suppose that we have a categorical variable with m categories (e.g., m countries). First, we must factor it into m binary variables called dummy variables, D1, D2,..., Dm (e.g., D1=1 if Country=Japan and 0 otherwise; D2=1 if Country=USA and 0 otherwise, etc.) Then, we include m-1 of the dummy variables in the regression model. The major point is to exclude one of the m dummy variables to avoid redundancy. The excluded dummy's category is called the "reference category". Mathematically, it does not matter which dummy you exclude, although the resulting coefficients will be interpreted relative to the reference category, so if interpretation is important it's useful to choose the reference category as the one we most want to compare against.

In linear and logistic regression models, including all m variables will lead to perfect multicollinearity, which will typically cause failure of the estimation algorithm. Smarter software will identify the problem and drop one of the dummies for you. That is why every statistics book or course on regression will emphasize the need to drop one of the dummy variables.

Now comes the surprising part: when using categorical predictors in machine learning algorithms such as k-nearest neighbors (kNN) or classification and regression trees, we keep all m dummy variables. The reason is that in such algorithms we do not create linear combinations of all predictors. A tree, for instance, will choose a subset of the predictors. If we leave out one dummy, then if that category differs from the other categories in terms of the output of interest, the tree will not be able to detect it! Similarly, dropping a dummy in kNN would not incorporate the effect of belonging to that category into the distance used.

The only case where dummy variable inclusion is treated equally across methods is for a two-category predictor, such as Gender. In that case a single dummy variable will suffice in regression, kNN, CART, or any other data mining method.