Friday, December 19, 2014

New curriculum design guidelines by American Statistical Association: Who will teach?

The American Statistical Association published new "Curriculum Guidelines for Undergraduate Programs in Statistical Science". This is the first update to the guidelines since 2000.
The executive summary lists the key points:
  1. Increased importance of data science
  2. Real applications
  3. More diverse models and approaches
  4. Ability to communicate
This set sounds right on target with what is expected of statisticians in industry (the authors of the report include prominent statisticians in industry). It highlights the current narrow focus of statistics programs as well as their lack of real-world usability. 

I found three notable mentions in the descriptions of the above points:
Point #1: "Students should be fluent in higher-level programming languages and facile with database systems."
Point #2: "Students require exposure to and practice with a variety of predictive and explanatory models in addition to methods for model-building and assessment."
Point #3: "Students need to be able to communicate complex statistical methods in basic terms to managers and other audiences and to visualize results in an accessible manner"
Agree! But - are Statistics faculty qualified to teach these topics/skills? Since these capabilities are not built into most Statistics graduate programs, faculty in Statistics departments typically have not been exposed to these topics, nor to methods for teaching them (two different skills!). While one can delegate programming to computer science instructors, a gap is being created between the students' abilities and the Statistics faculty abilities.

Point #2 talks about prediction and explanation - an extremely important distinction for both practice and research statisticians. This topic is still quite blurred in the Statistics community as well as in many other domains , and textbooks have still not caught up, thereby creating a gap in needed teaching materials.

Point #3 is an interesting one: while data visualization is a key concept in Statistics, it is typically used in the context of the Exploratory Data Analysis, where charts and summaries are used by the statistician to understand the data prior to analysis. Point #3 talks about a different use of visualization, for the purpose of communication between the statistician and the stakeholder. This requires a different approach to visualization, different from classic classes on box plots, histograms, and computing percentiles.

To summarize: great suggestions for improving the undergrad curriculum. But, successful implementation requires professional development for most faculty teaching in such programs.

Let me add my own key point, which is a critical issue underlying many data scandals and sagas: "Students need to understand what population their final cleaned sample generalizes to". The issue of generalization, not just in the sense of statistical inference, is at the heart of using data to come up with insights and decisions for new records and/or in new situations. After sampling, cleaning (!!), pre-processing, and analyzing the data, you often end up with results that are relevant to a very restricted population, which is far from what you initially intended.

On the aside: note the use of the term "Data Science" in the report - a term now claimed by statisticians, operations researchers, computer scientists and anyone trying to ride the new buzz. What does it mean here? The report reads (page 7):
Although a formal definition of data science is elusive, we concur with the StatsNSF committee statement that data science comprises the “science of planning for, acquisition, management, analysis of, and inference from data.”
Oops - what about non-inference uses such as prediction? and communication?

Thursday, October 16, 2014

What's in a name? "Data" in Mandarin Chinese

The term "data", now popularly used in many languages, is not as innocent as it seems. The biggest controversy that I've been aware of is whether the English term "data" is singular or plural. The tone of an entire article would be different based on the author's decision.

In Hebrew, the word is in plural (Netunim, with the final "im" signifying plural), so no question arises.

Today I discovered another "data" duality, this time in Mandarin Chinese. In Taiwan, the term used is 資料 (Zīliào), while in Mainland China it is 數據 (Shùjù). Which one to use? What is the difference? I did a little research and tried a few popularity tests:

  1. Google Translate from Chinese to English translates both terms to "data". But Chinese-to-English translates data to 數據 (Shùjù) with the other term appearing as secondary. Here we also learn that 資料 (Zīliào) means "material".
  2. Chinese Wikipedia's main data article (embarrassingly poor) is for 數據 (Shùjù) and the article for 資料 (Zīliào) redirects you to the main article.
  3. A Google search of each term leads to surprising results on number of hits:

Search results for "data" term Zīliào
Search results for "data" term Shùjù
I asked a few colleagues from different Chinese-speaking countries and learned further that 資料 (Zīliào) translates to information. A Google Images search brings images of "Information". This might also explain the double hit rate. A duality between data and information is especially interesting given the relationship between the two (and my related work on Information Quality with Ron Kenett).

So what about Big Data?  Here too there appear to be different possible terms, yet the most popular seems to be 大数据 (Dà shùjù), which also has a reasonably respectable Wikipedia article.

Thanks to my learned colleagues Chun-houh Chen (Academia Sinica), Khim Yong Goh (National University of Singapore), and Mingfeng Lin (University of Arizona) for their inputs.

Friday, September 26, 2014

Humane and Socially Responsible Analytics: A new concentration at National Tsing Hua University

This Fall, I'm introducing two new elective courses at NTHU's Institute of Service Science: Business Analytics using Data Mining and Business Analytics using Forecasting (if you're wondering about the difference, see an earlier post). The two new courses join three other elective courses to form the new concentration in Business Analytics. Courses in this concentration are aimed at getting students into the world of analytics by doing. The courses are designed as hands-on, project-oriented courses, with global contests, that allow students to experience different tools. Most importantly, our program is focused on humane and socially responsible analytics. We discuss and consider analytics applications from a more holistic view, considering not only the business advantage to a company or organization, but also implications to individuals, communities, the environment, society, and beyond. And our courses are sufficiently long (18 weeks of 3-hour weekly sessions) to allow for in-depth experience and learning.

Forget buzzwords. It's about intention.
"Holistic analytics" is a term used by marketing analytics folks. It typically means "understand your customer really well (360 degrees) so that you can optimize your profit" (here's an example). We've seen business school courses being built around buzzwords such as "ethics" and "corporate social responsibility". The honest ones are focused on changing the mindset in terms of what we're optimizing.

To the best of my knowledge, the NTHU Business Analytics (BA) concentration is the only program in Taiwan offering such a combination of business and analytics. And I believe it is the only BA program globally focusing strongly on humane and socially responsible analytics (if you know of other such programs - please let me know so we can explore synergies!).

Friday, September 19, 2014

India redefines "reciprocity"; Israeli professionals pay the price

After a few years of employment at the Indian School of Business (in 2010 as a visitor and later as a tenured SRITNE Chaired Professor of Data Analytics), the time has come for me to get a new Employment Visa. As an Israeli-American, I decided to apply for the visa using my Israeli passport. I was almost on my way to the Indian embassy when I discovered, to my horror, that the fee is over USD $1000 for a one-year visa on an Israeli passport. The more interesting part is that Israelis are charged the highest fee compared to any other passport holder. For all other countries except UK and UAE the fee is between $100-200.

Wednesday, April 02, 2014

Parallel coordinate plot in Tableau: a workaround

The parallel coordinate plot is useful for visualizing multivariate data in a dis-aggregated way, where we have multiple numerical measurements for each record. A scatter plot displays two measurements for each record by using the two axes. A parallel coordinate plot can display many measurements for each record, by using many (parallel) axes - one for each measurement.

Saturday, March 15, 2014

Can women be professors or doctors? Not according to Jet Airways

I am already used to the comical scene at airports in Asia, where a sign-holder with "Professor Galit Shmueli" sees us walk in his/her direction and right away rushes to my husband. Whether or not the stereotype is based on actual gender statistics of professors in Asia is a good question.

What I don't find amusing is when a corporate like Jet Airways, under the guise of "celebrating international women's day", follows the same stereotype. When I tried to book a flight on, it would not allow me to use the Women's Day discount code if I chose title "Prof" or "Dr". Only if I chose "Mrs" or "Ms" would it work.

A Professor does not qualify as a woman

So I bowed low and switched the title in the reservation to "Mrs", only to get the error message

After scratching my head, I realized that I was (unfortunately?) logged into my JetPrivilege account where my title is "Dr" - a detail set at the time of the account creation that I cannot modify online. The workaround that I found was to dissociate the passenger from the account owner, and book for a "Mrs. Galit Shmueli", who obviously cannot be a Professor.
"Conflicting" information

For those who won't tolerate the humiliation of giving up Dr/Prof but are determined to get the discount in principle, a solution is to include another (non-Professor or non-Doctor) "Mrs" or "Ms" in the same booking. Yes, I'm being cynical.

In case you're thinking: "but how will the airline's booking system identify the passenger's gender if you use Prof or Dr?" - I can think of a few easy solutions such as adding the option "Prof. (Ms.)" or simply asking the traveler's gender, as common in train bookings. In short, it's beyond blaming "technology".

One thing is clear: According to Jet Airways, you just can't have it all. A JetPrivilege account with title "Prof", flying solo, and availing the Women's Day discount with your JetPrivilege number.

My only consolation is that during the flight I'll be able to enjoy "audio tracks from such leading female international artists as Beyonce Knowles, Lady Gaga, Jennifer Hudson, Taylor Swift, Kelly Clarkson and Rihanna on the airline's award-winning in-flight entertainment system." Luckily, Jet Airways doesn't include "artist" as a title.

Thursday, March 06, 2014

The use of dummy variables in predictive algorithms

Anyone who has taken a course in statistics that covers linear regression has heard some version of the rule regarding pre-processing categorical predictors with more than two categories and the need to factor them into binary dummy/indicator variables:
"If a variable has k levels, you can create only k-1 indicators. You have to choose one of the k categories as a "baseline" and leave out its indicator." (from Business Statistics by Sharpe, De Veaux & Velleman)
Technically, one can easily create k dummy variables for k categories in any software. The reason for not including all k dummies as predictors in a linear regression is to avoid perfect multicollinearity, where an exact linear relationship exists between the k predictors. Perfect multicollinearity causes computational and interpretation challenges (see slide #6). This k-dummies issue is also called the Dummy Variable Trap.

While these guidelines are required for linear regression, which other predictive models require them? The k-1 dummy rule applies to models where all the predictors are considered together, as a linear combination. Therefore, in addition to linear regression models, the rule would apply to logistic regression models, discriminant analysis, and in some cases to neural networks.

What happens if we use k-1 dummies in other predictive models? 
The choice of the dropped dummy variable does not affect the results of regression models, but can affect other methods. For instance, let's consider a classification/regression tree. In a tree, predictors are evaluated one-by-one, and therefore omitting one of the k dummies can result in an inferior predictive model. For example, suppose we have 12 monthly dummies and that in reality only January is different from other months (the outcome differs between January and other months). Now, we run a tree omitting the January dummy as an input and keep the other 11 monthly dummies. The only way the tree might discover the January effect is by creating 11 levels of splits by each of the dummies. This is much less efficient than a single split on the January dummy.

This post is inspired by a discussion in the recent Predictive Analytics 1 online course. This topic deserves more than a short post, yet I haven't seen a thorough discussion anywhere.