Wednesday, August 19, 2015

Categorical predictors: how many dummies to use in regression vs. k-nearest neighbors

Recently I've had discussions with several instructors of data mining courses about a fact that is often left out of many books, but is quite important: different treatment of dummy variables in different data mining methods.

From http://blog.excelmasterseries.com
Statistics courses that cover linear or logistic regression teach us to be careful when including a categorical predictor variable in our model. Suppose that we have a categorical variable with m categories (e.g., m countries). First, we must factor it into m binary variables called dummy variables, D1, D2,..., Dm (e.g., D1=1 if Country=Japan and 0 otherwise; D2=1 if Country=USA and 0 otherwise, etc.) Then, we include m-1 of the dummy variables in the regression model. The major point is to exclude one of the m dummy variables to avoid redundancy. The excluded dummy's category is called the "reference category". Mathematically, it does not matter which dummy you exclude, although the resulting coefficients will be interpreted relative to the reference category, so if interpretation is important it's useful to choose the reference category as the one we most want to compare against.

In linear and logistic regression models, including all m variables will lead to perfect multicollinearity, which will typically cause failure of the estimation algorithm. Smarter software will identify the problem and drop one of the dummies for you. That is why every statistics book or course on regression will emphasize the need to drop one of the dummy variables.

Now comes the surprising part: when using categorical predictors in machine learning algorithms such as k-nearest neighbors (kNN) or classification and regression trees, we keep all m dummy variables. The reason is that in such algorithms we do not create linear combinations of all predictors. A tree, for instance, will choose a subset of the predictors. If we leave out one dummy, then if that category differs from the other categories in terms of the output of interest, the tree will not be able to detect it! Similarly, dropping a dummy in kNN would not incorporate the effect of belonging to that category into the distance used.

The only case where dummy variable inclusion is treated equally across methods is for a two-category predictor, such as Gender. In that case a single dummy variable will suffice in regression, kNN, CART, or any other data mining method.

Monday, March 02, 2015

Psychology journal bans statistical inference; knocks down server

In its recent editorial, the journal Basic and Applied Social Psychology announced that it will no longer accept papers that use classical statistical inference. No more p-values, t-tests, or even... confidence intervals! 
"prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about ‘‘significant’’ differences or lack thereof, and so on)... confidence intervals also are banned from BASP"
Many statisticians would agree that it is high time to move on from p-values and statistical inference to practical significance, estimation, more elaborate non-parametric modeling, and resampling for avoiding assumption-heavy models. This is especially so now, when datasets are becoming larger and technology is able to measure more minute effects. 

In our 2013 paper "Too Big To Fail: Large Samples and the p-value Problem" we raise the serious issue p-value-based decision making when using very large samples. Many have asked us for solutions that scale up p-values, but we haven't come across one that really works. Our focus was on detecting when you're "too large" and we emphasized the importance of focusing on effect magnitude, and precision (please do report standard errors!) 

Machine learners would probably advocate finally moving to predictive modeling and evaluation. Predictive power is straightforward to measure, although it isn't always what social science researchers are looking for.

But wait. What this editorial dictates is only half a revolution: it says what it will ban. But it does not offer a cohesive alternative beyond simple summary statistics. Focusing on effect magnitude is great for making results matter, but without reporting standard errors or confidence intervals, we don't know anything about the uncertainty of the effect. Abandoning any metric that relies on "had the experiment been replicated" is dangerous and misleading. First, this is more a philosophical assumption than an actual re-experimentation. Second, to test whether effects found in a sample generalize to a population of interest, we need the ability to replicate the results. Standard errors give some indication of how replicable the results are, under the same conditions.

Controversial editorial leads to heavy traffic on journal server

BASP's revolutionary decision has been gaining attention outside of psychology (a great tactic to promote a journal!) so much so that at times it is difficult to reach the controversial editorial. Some statisticians have blogged about this decision, others are tweeting. This is a great way to open a discussion about empirical analysis in the social sciences. However, we need to come up with alternatives that focus on uncertainty and the ultimate goal of generalization.

Saturday, February 07, 2015

Teaching spaces: "Analytics in a Studio"

My first semester at NTHU has been a great learning experience. I introduced and taught two new courses in our new Business Analytics concentration (data mining and forecasting). Both courses met once a week for a 3-hour session for a full semester (18 weeks). Although I've taught these courses in different forms, in different countries, and to different audiences, I had a special discovery this time. I discovered the critical role of the learning space on the quality of teaching and learning. Specifically for a topic that combines technical, creativity and communication skills.

"Case study" classroom
In my many years of experience as a student and later as a professor at multiple universities, I've experienced two types of spaces: a lecture hall and a "case study" classroom. While the latter is more conducive to in-class discussions, both spaces put the instructor (and his/her slides) in the front, separated from most the students, and place the students in rows. In both cases the instructor is typically standing or moving around, while the students are immobile. Not being exposed to alternatives, I am ashamed to say that I never doubted this arrangement. Until this semester.

Like all discoveries, it started from a challenge: the classroom allocated for my courses was a wide room with two long rows, hardly any space for the instructor and no visibility of the slides for most of the students on the sides. My courses had 20-30 students each. My first attempt was to rearrange the tables to create a U-shape, so that students could see each other and the slides. In hindsight, I was trying to create more of a "case study" environment. After one session I realized it didn't work. The U was too long and narrow and there was a feeling of suffocation. And stagnancy. One of my classes was transferred to a case-type classroom. I was relieved. But for the other class there was no such classroom available. I examined a few different classrooms, but they were all lecture halls suitable for larger audiences.

Teams tackle a challenge using a whiteboard
And then, I discovered "the studio". Intended for design workshops, this was a room with no tables or chairs, with walls that are whiteboards plus double-sided whiteboards on wheels. In a corner was a stack of hard sponge blocks and a few low foldable square tables. There's a projector and a screen. I decided to take the plunge with the data mining course, since it is designed as a blended course where class time is devoted to discussions and hands-on assignments and experiences. [Before coming to class, students read and watch videos, take a short quiz, and contribute to an online discussion].

Here is how we used the space: At least half of each session engaged teams of students in a problem/question that they needed to tackle using a whiteboard. The challenges I came up with emerged from the interaction with the students - from the online discussion board, from discussing the online quizzes, and from confusion/difficulties in the pre-designed in-class assignments. After each team worked on their board, we all moved from board to board, the team explained their approach, and I highlighted something about each solution/attempt. This provided great learning for everyone, including myself, since different teams usually approached the problems in different ways. And they encountered different problems or insights.
Students give feedback on other teams' proposals

The setup was also conducive for team project feedback. After each team presented their proposal, the other teams provided them feedback by writing on their "wall" (whiteboard). This personal touch - rather than an email or discussion board - seems to makes a difference in how the feedback is given and perceived.

Smartphones were often used to take photos of different boards - their own and well as others' boards.

Student demos software to others
During periods of the sessions where students needed to work on laptops, many chose to spread out on the floor - a more natural posture for many folks than sitting at a desk. Some used the sponges to place their laptops. A few used a square table where 4 people faced each other.

We also used the space to start class with a little stretching and yoga! The students liked the space. So did two colleagues (Prof. Rob Hyndman and Prof. Joao Moreira) who teach analytics courses at their universities and visited my courses. Some students complained at first about sitting on the hard floor, so I tried to make sure they don't sit for long, or at least not passively. My own "old school" bias made me forget how it feels to be passively sitting.

Visitor Prof. Moreira experiences the studio
Although I could see the incredible advantages during the semester, I waited till its end to write this post. My perspective now is that teaching analytics in a studio is revolutionary. The space supports deeper learning, beneficial collaboration both within groups and across groups, better personalization of the teaching level by stronger communication between the instructor and students, and overall a high-energy and positive experience for everyone. One reason that makes "analytics in a studio" so powerful is the creativity aspect in data analytics. You use statistical and data mining foundations, but the actual problem-solving requires creativity and out-of-the-box thought.

From my experience, the requirements for "analytics in a studio" to work are:
  1. Students must come prepared to class with the needed technical basics (e.g., via reading/video watching/etc.) 
  2. The instructor must be flexible in terms of the specifics taught. I came into class focused on 2-3 main points students needed to learn, I had in-class assignments, and designed teams-on-whiteboards challenges on-the-fly. 
  3. The instructor is no longer physically in the center, but s/he must be an effective integrator, challenger, and guide of the directions taken. This allows students to unleash their abilities, but in a constructive way. It also helps avoid a feeling of "what did we learn?"
How does "analytics in a studio" scale to larger audiences? I am not sure. While class sizes of many Analytics programs are growing to meet the demand, top programs and educators should carefully consider the benefits of smaller class sizes in terms of learning and learning experience. And they should carefully choose their spaces.

Friday, December 19, 2014

New curriculum design guidelines by American Statistical Association: Who will teach?

The American Statistical Association published new "Curriculum Guidelines for Undergraduate Programs in Statistical Science". This is the first update to the guidelines since 2000.
The executive summary lists the key points:
  1. Increased importance of data science
  2. Real applications
  3. More diverse models and approaches
  4. Ability to communicate
This set sounds right on target with what is expected of statisticians in industry (the authors of the report include prominent statisticians in industry). It highlights the current narrow focus of statistics programs as well as their lack of real-world usability. 

I found three notable mentions in the descriptions of the above points:
Point #1: "Students should be fluent in higher-level programming languages and facile with database systems."
Point #2: "Students require exposure to and practice with a variety of predictive and explanatory models in addition to methods for model-building and assessment."
Point #3: "Students need to be able to communicate complex statistical methods in basic terms to managers and other audiences and to visualize results in an accessible manner"
Agree! But - are Statistics faculty qualified to teach these topics/skills? Since these capabilities are not built into most Statistics graduate programs, faculty in Statistics departments typically have not been exposed to these topics, nor to methods for teaching them (two different skills!). While one can delegate programming to computer science instructors, a gap is being created between the students' abilities and the Statistics faculty abilities.

Point #2 talks about prediction and explanation - an extremely important distinction for both practice and research statisticians. This topic is still quite blurred in the Statistics community as well as in many other domains , and textbooks have still not caught up, thereby creating a gap in needed teaching materials.

Point #3 is an interesting one: while data visualization is a key concept in Statistics, it is typically used in the context of the Exploratory Data Analysis, where charts and summaries are used by the statistician to understand the data prior to analysis. Point #3 talks about a different use of visualization, for the purpose of communication between the statistician and the stakeholder. This requires a different approach to visualization, different from classic classes on box plots, histograms, and computing percentiles.

To summarize: great suggestions for improving the undergrad curriculum. But, successful implementation requires professional development for most faculty teaching in such programs.

Let me add my own key point, which is a critical issue underlying many data scandals and sagas: "Students need to understand what population their final cleaned sample generalizes to". The issue of generalization, not just in the sense of statistical inference, is at the heart of using data to come up with insights and decisions for new records and/or in new situations. After sampling, cleaning (!!), pre-processing, and analyzing the data, you often end up with results that are relevant to a very restricted population, which is far from what you initially intended.

On the aside: note the use of the term "Data Science" in the report - a term now claimed by statisticians, operations researchers, computer scientists and anyone trying to ride the new buzz. What does it mean here? The report reads (page 7):
Although a formal definition of data science is elusive, we concur with the StatsNSF committee statement that data science comprises the “science of planning for, acquisition, management, analysis of, and inference from data.”
Oops - what about non-inference uses such as prediction? and communication?

Thursday, October 16, 2014

What's in a name? "Data" in Mandarin Chinese

The term "data", now popularly used in many languages, is not as innocent as it seems. The biggest controversy that I've been aware of is whether the English term "data" is singular or plural. The tone of an entire article would be different based on the author's decision.

In Hebrew, the word is in plural (Netunim, with the final "im" signifying plural), so no question arises.

Today I discovered another "data" duality, this time in Mandarin Chinese. In Taiwan, the term used is 資料 (Zīliào), while in Mainland China it is 數據 (Shùjù). Which one to use? What is the difference? I did a little research and tried a few popularity tests:

  1. Google Translate from Chinese to English translates both terms to "data". But Chinese-to-English translates data to 數據 (Shùjù) with the other term appearing as secondary. Here we also learn that 資料 (Zīliào) means "material".
  2. Chinese Wikipedia's main data article (embarrassingly poor) is for 數據 (Shùjù) and the article for 資料 (Zīliào) redirects you to the main article.
  3. A Google search of each term leads to surprising results on number of hits:

Search results for "data" term Zīliào
Search results for "data" term Shùjù
I asked a few colleagues from different Chinese-speaking countries and learned further that 資料 (Zīliào) translates to information. A Google Images search brings images of "Information". This might also explain the double hit rate. A duality between data and information is especially interesting given the relationship between the two (and my related work on Information Quality with Ron Kenett).


So what about Big Data?  Here too there appear to be different possible terms, yet the most popular seems to be 大数据 (Dà shùjù), which also has a reasonably respectable Wikipedia article.

Thanks to my learned colleagues Chun-houh Chen (Academia Sinica), Khim Yong Goh (National University of Singapore), and Mingfeng Lin (University of Arizona) for their inputs.

Friday, September 26, 2014

Humane and Socially Responsible Analytics: A new concentration at National Tsing Hua University

This Fall, I'm introducing two new elective courses at NTHU's Institute of Service Science: Business Analytics using Data Mining and Business Analytics using Forecasting (if you're wondering about the difference, see an earlier post). The two new courses join three other elective courses to form the new concentration in Business Analytics. Courses in this concentration are aimed at getting students into the world of analytics by doing. The courses are designed as hands-on, project-oriented courses, with global contests, that allow students to experience different tools. Most importantly, our program is focused on humane and socially responsible analytics. We discuss and consider analytics applications from a more holistic view, considering not only the business advantage to a company or organization, but also implications to individuals, communities, the environment, society, and beyond. And our courses are sufficiently long (18 weeks of 3-hour weekly sessions) to allow for in-depth experience and learning.

Forget buzzwords. It's about intention.
"Holistic analytics" is a term used by marketing analytics folks. It typically means "understand your customer really well (360 degrees) so that you can optimize your profit" (here's an example). We've seen business school courses being built around buzzwords such as "ethics" and "corporate social responsibility". The honest ones are focused on changing the mindset in terms of what we're optimizing.

To the best of my knowledge, the NTHU Business Analytics (BA) concentration is the only program in Taiwan offering such a combination of business and analytics. And I believe it is the only BA program globally focusing strongly on humane and socially responsible analytics (if you know of other such programs - please let me know so we can explore synergies!).

Friday, September 19, 2014

India redefines "reciprocity"; Israeli professionals pay the price

After a few years of employment at the Indian School of Business (in 2010 as a visitor and later as a tenured SRITNE Chaired Professor of Data Analytics), the time has come for me to get a new Employment Visa. As an Israeli-American, I decided to apply for the visa using my Israeli passport. I was almost on my way to the Indian embassy when I discovered, to my horror, that the fee is over USD $1000 for a one-year visa on an Israeli passport. The more interesting part is that Israelis are charged the highest fee compared to any other passport holder. For all other countries except UK and UAE the fee is between $100-200.