Thursday, October 16, 2014

What's in a name? "Data" in Mandarin Chinese

The term "data", now popularly used in many languages, is not as innocent as it seems. The biggest controversy that I've been aware of is whether the English term "data" is singular or plural. The tone of an entire article would be different based on the author's decision.

In Hebrew, the word is in plural (Netunim, with the final "im" signifying plural), so no question arises.

Today I discovered another "data" duality, this time in Mandarin Chinese. In Taiwan, the term used is 資料 (Zīliào), while in Mainland China it is 數據 (Shùjù). Which one to use? What is the difference? I did a little research and tried a few popularity tests:

  1. Google Translate from Chinese to English translates both terms to "data". But Chinese-to-English translates data to 數據 (Shùjù) with the other term appearing as secondary. Here we also learn that 資料 (Zīliào) means "material".
  2. Chinese Wikipedia's main data article (embarrassingly poor) is for 數據 (Shùjù) and the article for 資料 (Zīliào) redirects you to the main article.
  3. A Google search of each term leads to surprising results on number of hits:

Search results for "data" term Zīliào
Search results for "data" term Shùjù
I asked a few colleagues from different Chinese-speaking countries and learned further that 資料 (Zīliào) translates to information. A Google Images search brings images of "Information". This might also explain the double hit rate. A duality between data and information is especially interesting given the relationship between the two (and my related work on Information Quality with Ron Kenett).

So what about Big Data?  Here too there appear to be different possible terms, yet the most popular seems to be 大数据 (Dà shùjù), which also has a reasonably respectable Wikipedia article.

Thanks to my learned colleagues Chun-houh Chen (Academia Sinica), Khim Yong Goh (National University of Singapore), and Mingfeng Lin (University of Arizona) for their inputs.

Friday, September 26, 2014

Humane and Socially Responsible Analytics: A new concentration at National Tsing Hua University

This Fall, I'm introducing two new elective courses at NTHU's Institute of Service Science: Business Analytics using Data Mining and Business Analytics using Forecasting (if you're wondering about the difference, see an earlier post). The two new courses join three other elective courses to form the new concentration in Business Analytics. Courses in this concentration are aimed at getting students into the world of analytics by doing. The courses are designed as hands-on, project-oriented courses, with global contests, that allow students to experience different tools. Most importantly, our program is focused on humane and socially responsible analytics. We discuss and consider analytics applications from a more holistic view, considering not only the business advantage to a company or organization, but also implications to individuals, communities, the environment, society, and beyond. And our courses are sufficiently long (18 weeks of 3-hour weekly sessions) to allow for in-depth experience and learning.

Forget buzzwords. It's about intention.
"Holistic analytics" is a term used by marketing analytics folks. It typically means "understand your customer really well (360 degrees) so that you can optimize your profit" (here's an example). We've seen business school courses being built around buzzwords such as "ethics" and "corporate social responsibility". The honest ones are focused on changing the mindset in terms of what we're optimizing.

To the best of my knowledge, the NTHU Business Analytics (BA) concentration is the only program in Taiwan offering such a combination of business and analytics. And I believe it is the only BA program globally focusing strongly on humane and socially responsible analytics (if you know of other such programs - please let me know so we can explore synergies!).

Friday, September 19, 2014

India redefines "reciprocity"; Israeli professionals pay the price

After a few years of employment at the Indian School of Business (in 2010 as a visitor and later as a tenured SRITNE Chaired Professor of Data Analytics), the time has come for me to get a new Employment Visa. As an Israeli-American, I decided to apply for the visa using my Israeli passport. I was almost on my way to the Indian embassy when I discovered, to my horror, that the fee is over USD $1000 for a one-year visa on an Israeli passport. The more interesting part is that Israelis are charged the highest fee compared to any other passport holder. For all other countries except UK and UAE the fee is between $100-200.

Wednesday, April 02, 2014

Parallel coordinate plot in Tableau: a workaround

The parallel coordinate plot is useful for visualizing multivariate data in a dis-aggregated way, where we have multiple numerical measurements for each record. A scatter plot displays two measurements for each record by using the two axes. A parallel coordinate plot can display many measurements for each record, by using many (parallel) axes - one for each measurement.

Saturday, March 15, 2014

Can women be professors or doctors? Not according to Jet Airways

I am already used to the comical scene at airports in Asia, where a sign-holder with "Professor Galit Shmueli" sees us walk in his/her direction and right away rushes to my husband. Whether or not the stereotype is based on actual gender statistics of professors in Asia is a good question.

What I don't find amusing is when a corporate like Jet Airways, under the guise of "celebrating international women's day", follows the same stereotype. When I tried to book a flight on, it would not allow me to use the Women's Day discount code if I chose title "Prof" or "Dr". Only if I chose "Mrs" or "Ms" would it work.

A Professor does not qualify as a woman

So I bowed low and switched the title in the reservation to "Mrs", only to get the error message

After scratching my head, I realized that I was (unfortunately?) logged into my JetPrivilege account where my title is "Dr" - a detail set at the time of the account creation that I cannot modify online. The workaround that I found was to dissociate the passenger from the account owner, and book for a "Mrs. Galit Shmueli", who obviously cannot be a Professor.
"Conflicting" information

For those who won't tolerate the humiliation of giving up Dr/Prof but are determined to get the discount in principle, a solution is to include another (non-Professor or non-Doctor) "Mrs" or "Ms" in the same booking. Yes, I'm being cynical.

In case you're thinking: "but how will the airline's booking system identify the passenger's gender if you use Prof or Dr?" - I can think of a few easy solutions such as adding the option "Prof. (Ms.)" or simply asking the traveler's gender, as common in train bookings. In short, it's beyond blaming "technology".

One thing is clear: According to Jet Airways, you just can't have it all. A JetPrivilege account with title "Prof", flying solo, and availing the Women's Day discount with your JetPrivilege number.

My only consolation is that during the flight I'll be able to enjoy "audio tracks from such leading female international artists as Beyonce Knowles, Lady Gaga, Jennifer Hudson, Taylor Swift, Kelly Clarkson and Rihanna on the airline's award-winning in-flight entertainment system." Luckily, Jet Airways doesn't include "artist" as a title.

Thursday, March 06, 2014

The use of dummy variables in predictive algorithms

Anyone who has taken a course in statistics that covers linear regression has heard some version of the rule regarding pre-processing categorical predictors with more than two categories and the need to factor them into binary dummy/indicator variables:
"If a variable has k levels, you can create only k-1 indicators. You have to choose one of the k categories as a "baseline" and leave out its indicator." (from Business Statistics by Sharpe, De Veaux & Velleman)
Technically, one can easily create k dummy variables for k categories in any software. The reason for not including all k dummies as predictors in a linear regression is to avoid perfect multicollinearity, where an exact linear relationship exists between the k predictors. Perfect multicollinearity causes computational and interpretation challenges (see slide #6). This k-dummies issue is also called the Dummy Variable Trap.

While these guidelines are required for linear regression, which other predictive models require them? The k-1 dummy rule applies to models where all the predictors are considered together, as a linear combination. Therefore, in addition to linear regression models, the rule would apply to logistic regression models, discriminant analysis, and in some cases to neural networks.

What happens if we use k-1 dummies in other predictive models? 
The choice of the dropped dummy variable does not affect the results of regression models, but can affect other methods. For instance, let's consider a classification/regression tree. In a tree, predictors are evaluated one-by-one, and therefore omitting one of the k dummies can result in an inferior predictive model. For example, suppose we have 12 monthly dummies and that in reality only January is different from other months (the outcome differs between January and other months). Now, we run a tree omitting the January dummy as an input and keep the other 11 monthly dummies. The only way the tree might discover the January effect is by creating 11 levels of splits by each of the dummies. This is much less efficient than a single split on the January dummy.

This post is inspired by a discussion in the recent Predictive Analytics 1 online course. This topic deserves more than a short post, yet I haven't seen a thorough discussion anywhere.

Thursday, November 28, 2013

Running a data mining contest on Kaggle

Following the success last year, I've decided once again to introduce a data mining contest in my Business Analytics using Data Mining course at the Indian School of Business. Last year, I used two platforms: CrowdAnalytix and Kaggle. This year I am again using Kaggle. They offer free competition hosting for university instructors, called InClass Kaggle.

Setting up a competition on Kaggle is not trivial and I'd like to share some tips that I discovered to help fellow colleagues. Even if you successfully hosted a Kaggle contest a while ago, some things have changed (as I've discovered). With some assistance from the Kaggle support team, who are extremely helpful, I was able to decipher the process. So here goes:

Step #1: get your dataset into the right structure. Your initial dataset should include input and output columns for all records (assuming that the goal is to predict the outcome from the inputs). It should also include an ID column with running index numbers.

  • Save this as an Excel or CSV file. 
  • Split the records into two datasets: a training set and a test set. 
  • Keep the training and test datasets in separate CSV files. For the test set, remove the outcome column(s).
  • Kaggle will split the test set into a private and public subsets. It will score each of them separately. Results for the public records will appear in the leaderboard. Only you will see the results for the private subsets. If you want to assign the records yourself to public/private, create a column Usage in the test dataset and type Private or Public for each record.

Step #2: open a Kaggle InClass account and start a competition using the wizard. Filling in the Basic Details and Entry & Rules pages is straightforward.

Step #3: The tricky page is Your Data. Here you'll need to follow the following sequence in order to get it working:

  1. Choose the evaluation metric to be used in the competition. Kaggle has a bunch of different metrics to choose from. In my two Kaggle contests, I actually wanted a metric that was not on the list, and voila! the support team was able to help by activating a metric that was not generally available for my competition. Last year I used a lift-type measure. This year it is an average-cost-per-observation metric for a binary classification task. In short, if you don't find exactly what you're looking for, it is worth asking the folks at Kaggle.
  2. After the evaluation metric is set, upload a solution file (CSV format). This file should include only an ID column (with the IDs for all the records that participants should score), and the outcome column(s). If you include any other columns, you'll get error messages. The first row of your file should include the names of these columns.
  3. After you've uploaded a solutions file, you'll be able to see whether it was successful or not. Aside from error messages, you can see your uploaded files. Scroll to the bottom and you'll see the file that you've submitted; or if you submitted multiple times, you'll see all the submitted files; if you selected a random public/private partition, the "derived solution" file will include an extra column with labels "public" and "private". It's a good idea to download this file, so that you can later compare your results with the system.
  4. After the solution file has been successfully uploaded and its columns mapped, you must upload a "sample submission file". This file is used to map the columns in the solutions file with what needs to be measured by Kaggle. The file should include an ID column like that in the solution file, plus a column with the predictions. Nothing more, nothing less. Again, the first row should include the column names. You'll have an option to define rules about allowed values for these columns.
  5. After successfully submitting the sample submission file, you will be able to test the system by submitting (mock) solutions in the "submission playground". One good test is using the naive rule (in a classification task, submit all 0s or all 1s). Compare your result to the one on Kaggle to make sure everything is set up properly.
  6. Finally, in the "Additional data files" you upload the two data files: the training dataset (which includes the ID, input and output columns) and the test dataset (which includes the ID and input columns). It is also useful to upload a third file, which contains a sample valid submission. This will help participants see what their file should look like, and they can also try submitting this file to see how the system works. You can use the naive-rule submission file that you created earlier to test the system.
  7. That's it! The rest (Documentation, Preview and Overview) are quite straightforward. After you're done, you'll see a button "submit for review". You can also share the contest with another colleague prior to releasing it. Look for "Share this competition wizard with a coworker" on the Basic Details page.
If I've missed tips or tricks that others have used, please do share. My current competition, "predicting cab booking cancellation" (using real data from YourCabs in Bangalore) has just started, and it will be open not only to our students, but to the world. 
Submission deadline: Midnight Dec 22, 2013, India Standard Time. All welcome!