Tuesday, April 26, 2016

Statistical software should remove *** notation for statistical significance

Now that the emotional storm following the American Statistical Association's statement on p-values is slowing down (is it? was there even a storm outside of the statistics area?), let's think about a practical issue. One that greatly influences data analysis in most fields: statistical software. Statistical software influences which methods are used and how they are reported. Software companies thus affect entire disciplines and how they progress and communicate.
Star notation for p-value thresholds in statistical software

No matter whether your field uses SAS, SPSS (now IBM), STATA, or another statistical software package, you're likely to have seen the star notation (this isn't about hotel ratings). One star (*) means p-value<0.05, two stars (**) mean p-value<0.01, and three stars (***) mean p-value<0.001.

According to the ASA statement, p-values are not the source of the problem, but rather their discretization. The ASA recommends:

"P-values, when used, would be reported as values, rather than inequalities (p = .0168, rather than p < 0.05). Indeed, we envision there being better recognition that measurement of the strength of evidence really is continuous, rather than discrete."
This statement is a strong signal to the statistical software companies: continuing to use the star notation, even if your users are addicted to them, is in violation of the ASA recommendation. Will we be seeing any change soon?

Thursday, March 24, 2016

A non-traditional definition of Big Data: Big is Relative

I've noticed that in almost every talk or discussion that involves the term Big Data, one of the first slides by the presenter or the first questions to be asked by the audience is "what is Big Data?" The typical answer has to do with some digits, many V's, terms that end with "bytes", or statements about software or hardware capacity.

I beg to differ.

"Big" is relative. It is relative to a certain field, and specifically to the practices in the field. We therefore must consider the benchmark of a specific field to determine if today's data are "Big". My definition of Big Data is therefore data that require a field to change its practices of data processing and analysis.

On the one extreme, consider weather forecasting, where data collection, huge computing power, and algorithms for analyzing huge amounts of data have been around for a long time. So is today's climatology data "Big" for the field of weather forecasting? Probably not, unless you start considering new types of data that the "old" methods cannot process or analyze.

Another example is the field of genetics, where researchers have been working with an analyzing large-scale datasets (notably from the Human Genome Project) for some time. The "Big Data" in this field is about linking different databases and integrating domain knowledge with the patterns found in the data ("As big-data researchers churn through large tumour databases looking for patterns of mutations, they are adding new categories of breast cancer.")

On the other extreme, consider studies in the social sciences, in fields such as political science or psychology that have traditionally relied on 3-digit sample sizes (if you were lucky). In these fields, a sample of 100,000 people is Big Data, because it challenges the methodologies used by researchers in the field.  Here are some of the challenges that arise:

  • Old methods break down: the common method of statistical significance tests for testing theory no longer works, as p-values will tend to be tiny irrespective of practical significance (one more reason to carefully consider the recent statement by the American Statistical Association about the danger of using the "p-value < 0.5" rule.
  • Technology challenge: the statistical software and hardware used by many social science researchers might not be able to handle these new data sizes. Simple operations such as visualizing 100,000 observations in a scatter plot require new practices and software (such as state-of-the-art interactive software packages). 
  • Social science researchers need to learn how to ask more nuanced questions, now that richer data are available to them. 
  • Social scientists are not trained in data mining, yet the new sizes of datasets can allow them to discover patterns that are not hypothesized by theory
In terms of "variety" of data types, Big Data is again area-dependent. While text and network data might be new to engineering fields, social scientists have long had experience with text data (qualitative researchers have analyzed interviews, videos, etc. for long) and with social network data  (the origins of many of the metrics used today are in sociology).

In short, what is Big for one field can be considered small for another field. Big Data is field-dependent, and should be based on the "delta" (the difference) between previous data analysis practices and ones called for by today's data.

Monday, December 07, 2015

Predictive analytics in the long term

Ten years ago, micro-level prediction the way we know it today, was nearly absent in companies. MBAs learned about data analysis mostly in a requires statistics course, which covered mostly statistical inference and descriptive modeling. At the time, I myself was learning my way into the predictive world, and designed the first Data Mining course at University of Maryland's Smith School of Business (which is running successfully until today!). When I realized the gap, I started giving talks about the benefits of predictive analytics and its uses. And I've designed and taught a bunch of predictive analytics courses/programs around the world (USA, India, Taiwan) and online (Statistics.com). I should have been delighted at the sight of predictive analytics being so pervasively used in industry just ten years later. But the truth is: I am alarmed.

A recent Harvard Business Review article Don't Let Big Data Bury Your Brand touches on one aspect of predictive analytics usage to be alarmed about: companies do not realize that machine-learning-based predictive analytics can be excellent for short-term prediction, but poor in the long-term. The HBR article talks about the scenario of a CMO torn between the CEO's pressure to push prediction-based promotions (based on the IT department's data analysts), and his/her long-term brand-building efforts:
Advanced marketing analytics and big data make [balancing short-term revenue pursuit and long-term brand building] much harder today. If it was difficult before to defend branding investments with indefinite and distant payoffs, it is doubly so now that near-term sales can be so precisely engineered. Analytics allows a seeming omniscience about what promotional offers customers will find appealing. Big data allows impressive amounts of information to be obtained about the buying patterns and transaction histories of identifiable customers. Given marketing dollars and the discretion to invest them in either direction, the temptation to keep cash registers ringing is nearly irresistible. 
There are two reasons for the weakness of prediction in the long term: First, predictive analytics learn from the past to predict the future. In a dynamic setting where the future is very different from the past, predictions will obviously fail. Second, predictive analytics rely on correlations and associations between the inputs and the to-be-predicted output, not on causal relationships. While correlations can work well in the short term, they are much more sensitive in the long term.

Relying on correlations is not a bad thing, even though the typical statistician will give you the derogative look of "correlation is not causation". Correlations a very useful for short term prediction. They are a fast and useful proxy for assessing the similarity of things, when all we care about is whether they are similar or not. Predictive analytics tells us what to do. But they don't tell us why. And in the long term, we often need to know why in order to devise proper predictions, scenarios, and policies.

The danger is then using predictive analytics for long-term prediction or planning. It's a good tool, but it has its limits. Prediction becomes much more valuable when it is combined with explanation. The good news is that establishing causality is also possible with Big Data: you run experiments (the now-popular A/B testing is a simple experiment), or you rely on other causal expert knowledge. There are even methods that use Big Data to quantify causal relationships from observational data, but they are trickier and more commonly used in academia than in practice (that will come!).

Bottom line: we need a combination of causal modeling and predictive modeling in order to make use of data for short-term and long-term actions and planning.  The predictive toolkit can help discover correlations; we can then use experiments (or surveys) to figure out why. And then improve our long-term predictions. It's a cycle.

Wednesday, August 19, 2015

Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Recently I've had discussions with several instructors of data mining courses about a fact that is often left out of many books, but is quite important: different treatment of dummy variables in different data mining methods.

From http://blog.excelmasterseries.com
Statistics courses that cover linear or logistic regression teach us to be careful when including a categorical predictor variable in our model. Suppose that we have a categorical variable with m categories (e.g., m countries). First, we must factor it into m binary variables called dummy variables, D1, D2,..., Dm (e.g., D1=1 if Country=Japan and 0 otherwise; D2=1 if Country=USA and 0 otherwise, etc.) Then, we include m-1 of the dummy variables in the regression model. The major point is to exclude one of the m dummy variables to avoid redundancy. The excluded dummy's category is called the "reference category". Mathematically, it does not matter which dummy you exclude, although the resulting coefficients will be interpreted relative to the reference category, so if interpretation is important it's useful to choose the reference category as the one we most want to compare against.

In linear and logistic regression models, including all m variables will lead to perfect multicollinearity, which will typically cause failure of the estimation algorithm. Smarter software will identify the problem and drop one of the dummies for you. That is why every statistics book or course on regression will emphasize the need to drop one of the dummy variables.

Now comes the surprising part: when using categorical predictors in machine learning algorithms such as k-nearest neighbors (kNN) or classification and regression trees, we keep all m dummy variables. The reason is that in such algorithms we do not create linear combinations of all predictors. A tree, for instance, will choose a subset of the predictors. If we leave out one dummy, then if that category differs from the other categories in terms of the output of interest, the tree will not be able to detect it! Similarly, dropping a dummy in kNN would not incorporate the effect of belonging to that category into the distance used.

The only case where dummy variable inclusion is treated equally across methods is for a two-category predictor, such as Gender. In that case a single dummy variable will suffice in regression, kNN, CART, or any other data mining method.

Monday, March 02, 2015

Psychology journal bans statistical inference; knocks down server

In its recent editorial, the journal Basic and Applied Social Psychology announced that it will no longer accept papers that use classical statistical inference. No more p-values, t-tests, or even... confidence intervals! 
"prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about ‘‘significant’’ differences or lack thereof, and so on)... confidence intervals also are banned from BASP"
Many statisticians would agree that it is high time to move on from p-values and statistical inference to practical significance, estimation, more elaborate non-parametric modeling, and resampling for avoiding assumption-heavy models. This is especially so now, when datasets are becoming larger and technology is able to measure more minute effects. 

In our 2013 paper "Too Big To Fail: Large Samples and the p-value Problem" we raise the serious issue p-value-based decision making when using very large samples. Many have asked us for solutions that scale up p-values, but we haven't come across one that really works. Our focus was on detecting when you're "too large" and we emphasized the importance of focusing on effect magnitude, and precision (please do report standard errors!) 

Machine learners would probably advocate finally moving to predictive modeling and evaluation. Predictive power is straightforward to measure, although it isn't always what social science researchers are looking for.

But wait. What this editorial dictates is only half a revolution: it says what it will ban. But it does not offer a cohesive alternative beyond simple summary statistics. Focusing on effect magnitude is great for making results matter, but without reporting standard errors or confidence intervals, we don't know anything about the uncertainty of the effect. Abandoning any metric that relies on "had the experiment been replicated" is dangerous and misleading. First, this is more a philosophical assumption than an actual re-experimentation. Second, to test whether effects found in a sample generalize to a population of interest, we need the ability to replicate the results. Standard errors give some indication of how replicable the results are, under the same conditions.

Controversial editorial leads to heavy traffic on journal server

BASP's revolutionary decision has been gaining attention outside of psychology (a great tactic to promote a journal!) so much so that at times it is difficult to reach the controversial editorial. Some statisticians have blogged about this decision, others are tweeting. This is a great way to open a discussion about empirical analysis in the social sciences. However, we need to come up with alternatives that focus on uncertainty and the ultimate goal of generalization.

Saturday, February 07, 2015

Teaching spaces: "Analytics in a Studio"

My first semester at NTHU has been a great learning experience. I introduced and taught two new courses in our new Business Analytics concentration (data mining and forecasting). Both courses met once a week for a 3-hour session for a full semester (18 weeks). Although I've taught these courses in different forms, in different countries, and to different audiences, I had a special discovery this time. I discovered the critical role of the learning space on the quality of teaching and learning. Specifically for a topic that combines technical, creativity and communication skills.

"Case study" classroom
In my many years of experience as a student and later as a professor at multiple universities, I've experienced two types of spaces: a lecture hall and a "case study" classroom. While the latter is more conducive to in-class discussions, both spaces put the instructor (and his/her slides) in the front, separated from most the students, and place the students in rows. In both cases the instructor is typically standing or moving around, while the students are immobile. Not being exposed to alternatives, I am ashamed to say that I never doubted this arrangement. Until this semester.

Like all discoveries, it started from a challenge: the classroom allocated for my courses was a wide room with two long rows, hardly any space for the instructor and no visibility of the slides for most of the students on the sides. My courses had 20-30 students each. My first attempt was to rearrange the tables to create a U-shape, so that students could see each other and the slides. In hindsight, I was trying to create more of a "case study" environment. After one session I realized it didn't work. The U was too long and narrow and there was a feeling of suffocation. And stagnancy. One of my classes was transferred to a case-type classroom. I was relieved. But for the other class there was no such classroom available. I examined a few different classrooms, but they were all lecture halls suitable for larger audiences.

Teams tackle a challenge using a whiteboard
And then, I discovered "the studio". Intended for design workshops, this was a room with no tables or chairs, with walls that are whiteboards plus double-sided whiteboards on wheels. In a corner was a stack of hard sponge blocks and a few low foldable square tables. There's a projector and a screen. I decided to take the plunge with the data mining course, since it is designed as a blended course where class time is devoted to discussions and hands-on assignments and experiences. [Before coming to class, students read and watch videos, take a short quiz, and contribute to an online discussion].

Here is how we used the space: At least half of each session engaged teams of students in a problem/question that they needed to tackle using a whiteboard. The challenges I came up with emerged from the interaction with the students - from the online discussion board, from discussing the online quizzes, and from confusion/difficulties in the pre-designed in-class assignments. After each team worked on their board, we all moved from board to board, the team explained their approach, and I highlighted something about each solution/attempt. This provided great learning for everyone, including myself, since different teams usually approached the problems in different ways. And they encountered different problems or insights.
Students give feedback on other teams' proposals

The setup was also conducive for team project feedback. After each team presented their proposal, the other teams provided them feedback by writing on their "wall" (whiteboard). This personal touch - rather than an email or discussion board - seems to makes a difference in how the feedback is given and perceived.

Smartphones were often used to take photos of different boards - their own and well as others' boards.

Student demos software to others
During periods of the sessions where students needed to work on laptops, many chose to spread out on the floor - a more natural posture for many folks than sitting at a desk. Some used the sponges to place their laptops. A few used a square table where 4 people faced each other.

We also used the space to start class with a little stretching and yoga! The students liked the space. So did two colleagues (Prof. Rob Hyndman and Prof. Joao Moreira) who teach analytics courses at their universities and visited my courses. Some students complained at first about sitting on the hard floor, so I tried to make sure they don't sit for long, or at least not passively. My own "old school" bias made me forget how it feels to be passively sitting.

Visitor Prof. Moreira experiences the studio
Although I could see the incredible advantages during the semester, I waited till its end to write this post. My perspective now is that teaching analytics in a studio is revolutionary. The space supports deeper learning, beneficial collaboration both within groups and across groups, better personalization of the teaching level by stronger communication between the instructor and students, and overall a high-energy and positive experience for everyone. One reason that makes "analytics in a studio" so powerful is the creativity aspect in data analytics. You use statistical and data mining foundations, but the actual problem-solving requires creativity and out-of-the-box thought.

From my experience, the requirements for "analytics in a studio" to work are:
  1. Students must come prepared to class with the needed technical basics (e.g., via reading/video watching/etc.) 
  2. The instructor must be flexible in terms of the specifics taught. I came into class focused on 2-3 main points students needed to learn, I had in-class assignments, and designed teams-on-whiteboards challenges on-the-fly. 
  3. The instructor is no longer physically in the center, but s/he must be an effective integrator, challenger, and guide of the directions taken. This allows students to unleash their abilities, but in a constructive way. It also helps avoid a feeling of "what did we learn?"
How does "analytics in a studio" scale to larger audiences? I am not sure. While class sizes of many Analytics programs are growing to meet the demand, top programs and educators should carefully consider the benefits of smaller class sizes in terms of learning and learning experience. And they should carefully choose their spaces.

Friday, December 19, 2014

New curriculum design guidelines by American Statistical Association: Who will teach?

The American Statistical Association published new "Curriculum Guidelines for Undergraduate Programs in Statistical Science". This is the first update to the guidelines since 2000.
The executive summary lists the key points:
  1. Increased importance of data science
  2. Real applications
  3. More diverse models and approaches
  4. Ability to communicate
This set sounds right on target with what is expected of statisticians in industry (the authors of the report include prominent statisticians in industry). It highlights the current narrow focus of statistics programs as well as their lack of real-world usability. 

I found three notable mentions in the descriptions of the above points:
Point #1: "Students should be fluent in higher-level programming languages and facile with database systems."
Point #2: "Students require exposure to and practice with a variety of predictive and explanatory models in addition to methods for model-building and assessment."
Point #3: "Students need to be able to communicate complex statistical methods in basic terms to managers and other audiences and to visualize results in an accessible manner"
Agree! But - are Statistics faculty qualified to teach these topics/skills? Since these capabilities are not built into most Statistics graduate programs, faculty in Statistics departments typically have not been exposed to these topics, nor to methods for teaching them (two different skills!). While one can delegate programming to computer science instructors, a gap is being created between the students' abilities and the Statistics faculty abilities.

Point #2 talks about prediction and explanation - an extremely important distinction for both practice and research statisticians. This topic is still quite blurred in the Statistics community as well as in many other domains , and textbooks have still not caught up, thereby creating a gap in needed teaching materials.

Point #3 is an interesting one: while data visualization is a key concept in Statistics, it is typically used in the context of the Exploratory Data Analysis, where charts and summaries are used by the statistician to understand the data prior to analysis. Point #3 talks about a different use of visualization, for the purpose of communication between the statistician and the stakeholder. This requires a different approach to visualization, different from classic classes on box plots, histograms, and computing percentiles.

To summarize: great suggestions for improving the undergrad curriculum. But, successful implementation requires professional development for most faculty teaching in such programs.

Let me add my own key point, which is a critical issue underlying many data scandals and sagas: "Students need to understand what population their final cleaned sample generalizes to". The issue of generalization, not just in the sense of statistical inference, is at the heart of using data to come up with insights and decisions for new records and/or in new situations. After sampling, cleaning (!!), pre-processing, and analyzing the data, you often end up with results that are relevant to a very restricted population, which is far from what you initially intended.

On the aside: note the use of the term "Data Science" in the report - a term now claimed by statisticians, operations researchers, computer scientists and anyone trying to ride the new buzz. What does it mean here? The report reads (page 7):
Although a formal definition of data science is elusive, we concur with the StatsNSF committee statement that data science comprises the “science of planning for, acquisition, management, analysis of, and inference from data.”
Oops - what about non-inference uses such as prediction? and communication?