Monday, December 07, 2015

Predictive analytics in the long term

Ten years ago, micro-level prediction the way we know it today, was nearly absent in companies. MBAs learned about data analysis mostly in a requires statistics course, which covered mostly statistical inference and descriptive modeling. At the time, I myself was learning my way into the predictive world, and designed the first Data Mining course at University of Maryland's Smith School of Business (which is running successfully until today!). When I realized the gap, I started giving talks about the benefits of predictive analytics and its uses. And I've designed and taught a bunch of predictive analytics courses/programs around the world (USA, India, Taiwan) and online ( I should have been delighted at the sight of predictive analytics being so pervasively used in industry just ten years later. But the truth is: I am alarmed.

A recent Harvard Business Review article Don't Let Big Data Bury Your Brand touches on one aspect of predictive analytics usage to be alarmed about: companies do not realize that machine-learning-based predictive analytics can be excellent for short-term prediction, but poor in the long-term. The HBR article talks about the scenario of a CMO torn between the CEO's pressure to push prediction-based promotions (based on the IT department's data analysts), and his/her long-term brand-building efforts:
Advanced marketing analytics and big data make [balancing short-term revenue pursuit and long-term brand building] much harder today. If it was difficult before to defend branding investments with indefinite and distant payoffs, it is doubly so now that near-term sales can be so precisely engineered. Analytics allows a seeming omniscience about what promotional offers customers will find appealing. Big data allows impressive amounts of information to be obtained about the buying patterns and transaction histories of identifiable customers. Given marketing dollars and the discretion to invest them in either direction, the temptation to keep cash registers ringing is nearly irresistible. 
There are two reasons for the weakness of prediction in the long term: First, predictive analytics learn from the past to predict the future. In a dynamic setting where the future is very different from the past, predictions will obviously fail. Second, predictive analytics rely on correlations and associations between the inputs and the to-be-predicted output, not on causal relationships. While correlations can work well in the short term, they are much more sensitive in the long term.

Relying on correlations is not a bad thing, even though the typical statistician will give you the derogative look of "correlation is not causation". Correlations a very useful for short term prediction. They are a fast and useful proxy for assessing the similarity of things, when all we care about is whether they are similar or not. Predictive analytics tells us what to do. But they don't tell us why. And in the long term, we often need to know why in order to devise proper predictions, scenarios, and policies.

The danger is then using predictive analytics for long-term prediction or planning. It's a good tool, but it has its limits. Prediction becomes much more valuable when it is combined with explanation. The good news is that establishing causality is also possible with Big Data: you run experiments (the now-popular A/B testing is a simple experiment), or you rely on other causal expert knowledge. There are even methods that use Big Data to quantify causal relationships from observational data, but they are trickier and more commonly used in academia than in practice (that will come!).

Bottom line: we need a combination of causal modeling and predictive modeling in order to make use of data for short-term and long-term actions and planning.  The predictive toolkit can help discover correlations; we can then use experiments (or surveys) to figure out why. And then improve our long-term predictions. It's a cycle.

Wednesday, August 19, 2015

Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Recently I've had discussions with several instructors of data mining courses about a fact that is often left out of many books, but is quite important: different treatment of dummy variables in different data mining methods.

Statistics courses that cover linear or logistic regression teach us to be careful when including a categorical predictor variable in our model. Suppose that we have a categorical variable with m categories (e.g., m countries). First, we must factor it into m binary variables called dummy variables, D1, D2,..., Dm (e.g., D1=1 if Country=Japan and 0 otherwise; D2=1 if Country=USA and 0 otherwise, etc.) Then, we include m-1 of the dummy variables in the regression model. The major point is to exclude one of the m dummy variables to avoid redundancy. The excluded dummy's category is called the "reference category". Mathematically, it does not matter which dummy you exclude, although the resulting coefficients will be interpreted relative to the reference category, so if interpretation is important it's useful to choose the reference category as the one we most want to compare against.

In linear and logistic regression models, including all m variables will lead to perfect multicollinearity, which will typically cause failure of the estimation algorithm. Smarter software will identify the problem and drop one of the dummies for you. That is why every statistics book or course on regression will emphasize the need to drop one of the dummy variables.

Now comes the surprising part: when using categorical predictors in machine learning algorithms such as k-nearest neighbors (kNN) or classification and regression trees, we keep all m dummy variables. The reason is that in such algorithms we do not create linear combinations of all predictors. A tree, for instance, will choose a subset of the predictors. If we leave out one dummy, then if that category differs from the other categories in terms of the output of interest, the tree will not be able to detect it! Similarly, dropping a dummy in kNN would not incorporate the effect of belonging to that category into the distance used.

The only case where dummy variable inclusion is treated equally across methods is for a two-category predictor, such as Gender. In that case a single dummy variable will suffice in regression, kNN, CART, or any other data mining method.

Monday, March 02, 2015

Psychology journal bans statistical inference; knocks down server

In its recent editorial, the journal Basic and Applied Social Psychology announced that it will no longer accept papers that use classical statistical inference. No more p-values, t-tests, or even... confidence intervals! 
"prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about ‘‘significant’’ differences or lack thereof, and so on)... confidence intervals also are banned from BASP"
Many statisticians would agree that it is high time to move on from p-values and statistical inference to practical significance, estimation, more elaborate non-parametric modeling, and resampling for avoiding assumption-heavy models. This is especially so now, when datasets are becoming larger and technology is able to measure more minute effects. 

In our 2013 paper "Too Big To Fail: Large Samples and the p-value Problem" we raise the serious issue p-value-based decision making when using very large samples. Many have asked us for solutions that scale up p-values, but we haven't come across one that really works. Our focus was on detecting when you're "too large" and we emphasized the importance of focusing on effect magnitude, and precision (please do report standard errors!) 

Machine learners would probably advocate finally moving to predictive modeling and evaluation. Predictive power is straightforward to measure, although it isn't always what social science researchers are looking for.

But wait. What this editorial dictates is only half a revolution: it says what it will ban. But it does not offer a cohesive alternative beyond simple summary statistics. Focusing on effect magnitude is great for making results matter, but without reporting standard errors or confidence intervals, we don't know anything about the uncertainty of the effect. Abandoning any metric that relies on "had the experiment been replicated" is dangerous and misleading. First, this is more a philosophical assumption than an actual re-experimentation. Second, to test whether effects found in a sample generalize to a population of interest, we need the ability to replicate the results. Standard errors give some indication of how replicable the results are, under the same conditions.

Controversial editorial leads to heavy traffic on journal server

BASP's revolutionary decision has been gaining attention outside of psychology (a great tactic to promote a journal!) so much so that at times it is difficult to reach the controversial editorial. Some statisticians have blogged about this decision, others are tweeting. This is a great way to open a discussion about empirical analysis in the social sciences. However, we need to come up with alternatives that focus on uncertainty and the ultimate goal of generalization.

Saturday, February 07, 2015

Teaching spaces: "Analytics in a Studio"

My first semester at NTHU has been a great learning experience. I introduced and taught two new courses in our new Business Analytics concentration (data mining and forecasting). Both courses met once a week for a 3-hour session for a full semester (18 weeks). Although I've taught these courses in different forms, in different countries, and to different audiences, I had a special discovery this time. I discovered the critical role of the learning space on the quality of teaching and learning. Specifically for a topic that combines technical, creativity and communication skills.

"Case study" classroom
In my many years of experience as a student and later as a professor at multiple universities, I've experienced two types of spaces: a lecture hall and a "case study" classroom. While the latter is more conducive to in-class discussions, both spaces put the instructor (and his/her slides) in the front, separated from most the students, and place the students in rows. In both cases the instructor is typically standing or moving around, while the students are immobile. Not being exposed to alternatives, I am ashamed to say that I never doubted this arrangement. Until this semester.

Like all discoveries, it started from a challenge: the classroom allocated for my courses was a wide room with two long rows, hardly any space for the instructor and no visibility of the slides for most of the students on the sides. My courses had 20-30 students each. My first attempt was to rearrange the tables to create a U-shape, so that students could see each other and the slides. In hindsight, I was trying to create more of a "case study" environment. After one session I realized it didn't work. The U was too long and narrow and there was a feeling of suffocation. And stagnancy. One of my classes was transferred to a case-type classroom. I was relieved. But for the other class there was no such classroom available. I examined a few different classrooms, but they were all lecture halls suitable for larger audiences.

Teams tackle a challenge using a whiteboard
And then, I discovered "the studio". Intended for design workshops, this was a room with no tables or chairs, with walls that are whiteboards plus double-sided whiteboards on wheels. In a corner was a stack of hard sponge blocks and a few low foldable square tables. There's a projector and a screen. I decided to take the plunge with the data mining course, since it is designed as a blended course where class time is devoted to discussions and hands-on assignments and experiences. [Before coming to class, students read and watch videos, take a short quiz, and contribute to an online discussion].

Here is how we used the space: At least half of each session engaged teams of students in a problem/question that they needed to tackle using a whiteboard. The challenges I came up with emerged from the interaction with the students - from the online discussion board, from discussing the online quizzes, and from confusion/difficulties in the pre-designed in-class assignments. After each team worked on their board, we all moved from board to board, the team explained their approach, and I highlighted something about each solution/attempt. This provided great learning for everyone, including myself, since different teams usually approached the problems in different ways. And they encountered different problems or insights.
Students give feedback on other teams' proposals

The setup was also conducive for team project feedback. After each team presented their proposal, the other teams provided them feedback by writing on their "wall" (whiteboard). This personal touch - rather than an email or discussion board - seems to makes a difference in how the feedback is given and perceived.

Smartphones were often used to take photos of different boards - their own and well as others' boards.

Student demos software to others
During periods of the sessions where students needed to work on laptops, many chose to spread out on the floor - a more natural posture for many folks than sitting at a desk. Some used the sponges to place their laptops. A few used a square table where 4 people faced each other.

We also used the space to start class with a little stretching and yoga! The students liked the space. So did two colleagues (Prof. Rob Hyndman and Prof. Joao Moreira) who teach analytics courses at their universities and visited my courses. Some students complained at first about sitting on the hard floor, so I tried to make sure they don't sit for long, or at least not passively. My own "old school" bias made me forget how it feels to be passively sitting.

Visitor Prof. Moreira experiences the studio
Although I could see the incredible advantages during the semester, I waited till its end to write this post. My perspective now is that teaching analytics in a studio is revolutionary. The space supports deeper learning, beneficial collaboration both within groups and across groups, better personalization of the teaching level by stronger communication between the instructor and students, and overall a high-energy and positive experience for everyone. One reason that makes "analytics in a studio" so powerful is the creativity aspect in data analytics. You use statistical and data mining foundations, but the actual problem-solving requires creativity and out-of-the-box thought.

From my experience, the requirements for "analytics in a studio" to work are:
  1. Students must come prepared to class with the needed technical basics (e.g., via reading/video watching/etc.) 
  2. The instructor must be flexible in terms of the specifics taught. I came into class focused on 2-3 main points students needed to learn, I had in-class assignments, and designed teams-on-whiteboards challenges on-the-fly. 
  3. The instructor is no longer physically in the center, but s/he must be an effective integrator, challenger, and guide of the directions taken. This allows students to unleash their abilities, but in a constructive way. It also helps avoid a feeling of "what did we learn?"
How does "analytics in a studio" scale to larger audiences? I am not sure. While class sizes of many Analytics programs are growing to meet the demand, top programs and educators should carefully consider the benefits of smaller class sizes in terms of learning and learning experience. And they should carefully choose their spaces.