Wednesday, April 02, 2014

Parallel coordinate plot in Tableau: a workaround

The parallel coordinate plot is useful for visualizing multivariate data in a dis-aggregated way, where we have multiple numerical measurements for each record. A scatter plot displays two measurements for each record by using the two axes. A parallel coordinate plot can display many measurements for each record, by using many (parallel) axes - one for each measurement.

While not as popular as other charts, it sometimes turns out to be useful, so it's good to have it in the visualization toolkit. Software such as TIBCO Spotfire and XLMiner include the parallel coordinate plot. There's even a free Excel add-on. But surprisingly, Tableau 8.1 still does not include this chart among its many neat options. After searching and not finding a straightforward answer, I found a simple workaround that allows creating a parallel coordinate plot in Tableau. For all those in search - here's the solution:

First, make sure to scale each measurement to a 0%-100% scale if it isn't already in that form (or if all the measurements on are the same scale, skip this step). In Tableau, scaling can be done by creating a new measure (Analysis> Create Calculated Field). The following formula will create a new scaled measure by subtracting the MIN and dividing by the range for a variable named Current Cost:

100*(AVG([Current Cost])-TOTAL(MIN([Current Cost]))) /
(TOTAL(MAX([Current Cost]))-TOTAL(MIN([Current Cost])))

Here's what it looks like in the Calculated Field menu:
First scale each measurement to a o-100% scale by creating a new calculated field with this formula
Now create a parallel coordinate plot of the scaled measures:
  1. Start a new worksheet
  2. Drag Measure Names to the Columns shelf. Drag Measure Values to the Rows shelf. 
  3. In the Measure Values shelf you'll see all the measures. Keep only the scaled measures of interest and remove all other measures.
  4. Change the bars to lines by selecting line in the Marks shelf.
  5. Lastly, since we want each line to represent a single record, drag the appropriate Dimension (such as record ID) into the Detail shelf.

Voila! Here's an example of what the chart and the settings look like for four measurements (each line represents a single record, in this case, a project). Of course the beauty is then interacting with this chart: filtering, coloring, etc.
Parallel coordinate plot implemented in Tableau
Thirsty for more? Check out the upcoming online course Interactive Data Visualization.

Saturday, March 15, 2014

Can women be professors or doctors? Not according to Jet Airways

I am already used to the comical scene at airports in Asia, where a sign-holder with "Professor Galit Shmueli" sees us walk in his/her direction and right away rushes to my husband. Whether or not the stereotype is based on actual gender statistics of professors in Asia is a good question.

What I don't find amusing is when a corporate like Jet Airways, under the guise of "celebrating international women's day", follows the same stereotype. When I tried to book a flight on Jetairways.com, it would not allow me to use the Women's Day discount code if I chose title "Prof" or "Dr". Only if I chose "Mrs" or "Ms" would it work.

A Professor does not qualify as a woman










So I bowed low and switched the title in the reservation to "Mrs", only to get the error message


After scratching my head, I realized that I was (unfortunately?) logged into my JetPrivilege account where my title is "Dr" - a detail set at the time of the account creation that I cannot modify online. The workaround that I found was to dissociate the passenger from the account owner, and book for a "Mrs. Galit Shmueli", who obviously cannot be a Professor.
"Conflicting" information

For those who won't tolerate the humiliation of giving up Dr/Prof but are determined to get the discount in principle, a solution is to include another (non-Professor or non-Doctor) "Mrs" or "Ms" in the same booking. Yes, I'm being cynical.

In case you're thinking: "but how will the airline's booking system identify the passenger's gender if you use Prof or Dr?" - I can think of a few easy solutions such as adding the option "Prof. (Ms.)" or simply asking the traveler's gender, as common in train bookings. In short, it's beyond blaming "technology".

One thing is clear: According to Jet Airways, you just can't have it all. A JetPrivilege account with title "Prof", flying solo, and availing the Women's Day discount with your JetPrivilege number.

My only consolation is that during the flight I'll be able to enjoy "audio tracks from such leading female international artists as Beyonce Knowles, Lady Gaga, Jennifer Hudson, Taylor Swift, Kelly Clarkson and Rihanna on the airline's award-winning in-flight entertainment system." Luckily, Jet Airways doesn't include "artist" as a title.

Thursday, March 06, 2014

The use of dummy variables in predictive algorithms

Anyone who has taken a course in statistics that covers linear regression has heard some version of the rule regarding pre-processing categorical predictors with more than two categories and the need to factor them into binary dummy/indicator variables:
"If a variable has k levels, you can create only k-1 indicators. You have to choose one of the k categories as a "baseline" and leave out its indicator." (from Business Statistics by Sharpe, De Veaux & Velleman)
Technically, one can easily create k dummy variables for k categories in any software. The reason for not including all k dummies as predictors in a linear regression is to avoid perfect multicollinearity, where an exact linear relationship exists between the k predictors. Perfect multicollinearity causes computational and interpretation challenges (see slide #6). This k-dummies issue is also called the Dummy Variable Trap.

While these guidelines are required for linear regression, which other predictive models require them? The k-1 dummy rule applies to models where all the predictors are considered together, as a linear combination. Therefore, in addition to linear regression models, the rule would apply to logistic regression models, discriminant analysis, and in some cases to neural networks.

What happens if we use k-1 dummies in other predictive models? 
The choice of the dropped dummy variable does not affect the results of regression models, but can affect other methods. For instance, let's consider a classification/regression tree. In a tree, predictors are evaluated one-by-one, and therefore omitting one of the k dummies can result in an inferior predictive model. For example, suppose we have 12 monthly dummies and that in reality only January is different from other months (the outcome differs between January and other months). Now, we run a tree omitting the January dummy as an input and keep the other 11 monthly dummies. The only way the tree might discover the January effect is by creating 11 levels of splits by each of the dummies. This is much less efficient than a single split on the January dummy.

This post is inspired by a discussion in the recent Predictive Analytics 1 online course. This topic deserves more than a short post, yet I haven't seen a thorough discussion anywhere.

Thursday, November 28, 2013

Running a data mining contest on Kaggle

Following the success last year, I've decided once again to introduce a data mining contest in my Business Analytics using Data Mining course at the Indian School of Business. Last year, I used two platforms: CrowdAnalytix and Kaggle. This year I am again using Kaggle. They offer free competition hosting for university instructors, called InClass Kaggle.

Setting up a competition on Kaggle is not trivial and I'd like to share some tips that I discovered to help fellow colleagues. Even if you successfully hosted a Kaggle contest a while ago, some things have changed (as I've discovered). With some assistance from the Kaggle support team, who are extremely helpful, I was able to decipher the process. So here goes:

Step #1: get your dataset into the right structure. Your initial dataset should include input and output columns for all records (assuming that the goal is to predict the outcome from the inputs). It should also include an ID column with running index numbers.

  • Save this as an Excel or CSV file. 
  • Split the records into two datasets: a training set and a test set. 
  • Keep the training and test datasets in separate CSV files. For the test set, remove the outcome column(s).
  • Kaggle will split the test set into a private and public subsets. It will score each of them separately. Results for the public records will appear in the leaderboard. Only you will see the results for the private subsets. If you want to assign the records yourself to public/private, create a column Usage in the test dataset and type Private or Public for each record.

Step #2: open a Kaggle InClass account and start a competition using the wizard. Filling in the Basic Details and Entry & Rules pages is straightforward.

Step #3: The tricky page is Your Data. Here you'll need to follow the following sequence in order to get it working:

  1. Choose the evaluation metric to be used in the competition. Kaggle has a bunch of different metrics to choose from. In my two Kaggle contests, I actually wanted a metric that was not on the list, and voila! the support team was able to help by activating a metric that was not generally available for my competition. Last year I used a lift-type measure. This year it is an average-cost-per-observation metric for a binary classification task. In short, if you don't find exactly what you're looking for, it is worth asking the folks at Kaggle.
  2. After the evaluation metric is set, upload a solution file (CSV format). This file should include only an ID column (with the IDs for all the records that participants should score), and the outcome column(s). If you include any other columns, you'll get error messages. The first row of your file should include the names of these columns.
  3. After you've uploaded a solutions file, you'll be able to see whether it was successful or not. Aside from error messages, you can see your uploaded files. Scroll to the bottom and you'll see the file that you've submitted; or if you submitted multiple times, you'll see all the submitted files; if you selected a random public/private partition, the "derived solution" file will include an extra column with labels "public" and "private". It's a good idea to download this file, so that you can later compare your results with the system.
  4. After the solution file has been successfully uploaded and its columns mapped, you must upload a "sample submission file". This file is used to map the columns in the solutions file with what needs to be measured by Kaggle. The file should include an ID column like that in the solution file, plus a column with the predictions. Nothing more, nothing less. Again, the first row should include the column names. You'll have an option to define rules about allowed values for these columns.
  5. After successfully submitting the sample submission file, you will be able to test the system by submitting (mock) solutions in the "submission playground". One good test is using the naive rule (in a classification task, submit all 0s or all 1s). Compare your result to the one on Kaggle to make sure everything is set up properly.
  6. Finally, in the "Additional data files" you upload the two data files: the training dataset (which includes the ID, input and output columns) and the test dataset (which includes the ID and input columns). It is also useful to upload a third file, which contains a sample valid submission. This will help participants see what their file should look like, and they can also try submitting this file to see how the system works. You can use the naive-rule submission file that you created earlier to test the system.
  7. That's it! The rest (Documentation, Preview and Overview) are quite straightforward. After you're done, you'll see a button "submit for review". You can also share the contest with another colleague prior to releasing it. Look for "Share this competition wizard with a coworker" on the Basic Details page.
If I've missed tips or tricks that others have used, please do share. My current competition, "predicting cab booking cancellation" (using real data from YourCabs in Bangalore) has just started, and it will be open not only to our students, but to the world. 
Submission deadline: Midnight Dec 22, 2013, India Standard Time. All welcome!

Thursday, November 21, 2013

The Scientific Value of Testing Predictive Performance

This week's NY Times article Risk Calculator for Cholesterol Appears Flawed and CNN article Does calculator overstate heart attack risk? illustrate the power of evaluating the predictive performance of a model for purposes of validating the underlying theory.

The NYT article describes findings by two Harvard Medical School professors, Ridker and Cook, about extreme over-estimation of the 10-year risk of a heart-attack or stroke when using a calculator released by the American Heart Association and the American College of Cardiology.
"According to the new guidelines, if a person's risk is above 7.5%, he or she should be put on a statin." (CNN article)
Over-estimation in this case is likely to lead to over-prescription of therapies such as cholesterol-lowering statin drugs, not to mention the psychological effect of being classified as high risk for a heart-attack or stroke.

How was this over-prediction discovered? 
"Dr. Ridker and Dr. Cook evaluated [the calculator] using three large studies that involved thousands of people and continued for at least a decade. They knew the subjects’ characteristics at the start — their ages, whether they smoked, their cholesterol levels, their blood pressures. Then they asked how many had heart attacks or strokes in the next 10 years and how many would the risk calculator predict."
In other words, the "model" (=calculator) was deployed to a large labeled dataset, and the actual and predicted rates of heart attacks were compared. This is the classic "holdout set" approach. The results are nicely shown in the article's chart, overlaying the actual and predicted values in histograms of risk:

Chart from NY Times article 

Beyond the practical usefulness of detecting the flaw in the calculators, evaluating predictive performance tells us something about the underlying model. A next natural question is "why?", or how was the calculator/model built?

The NYT article quotes Dr. Smith, a professor of medicine at the University of North Carolina and a past president of the American Heart Association:
“a lot of people put a lot of thought into how can we identify people who can benefit from therapy... What we have come forward with represents the best efforts of people who have been working for five years.”
Although this statement seems to imply that the guidelines are based on an informal qualitative integration of domain knowledge and experience, I am guessing (and hoping) that there is a sound data-based model behind the scenes. The fact that the calculator uses very few and coarse predictors makes me suspicious that the model was not designed or optimized for "personalized medicine".

One reason mentioned for the extreme over-prediction of this model on the three studies data is the difference between the population used to "train the calculator" (generate the guidelines) and the population in the evaluation studies in terms of the relationship between heart-attacks/strokes and the risk factors:
"The problem might have stemmed from the fact that the calculator uses as reference points data collected more than a decade ago, when more people smoked and had strokes and heart attacks earlier in life. For example, the guideline makers used data from studies in the 1990s to determine how various risk factors like cholesterol levels and blood pressure led to actual heart attacks and strokes over a decade of observation.
But people have changed in the past few decades, Dr. Blaha said. Among other things, there is no longer such a big gap between women’s risks and those of men at a given age. And people get heart attacks and strokes at older ages."
In predictive analytics, we know that the biggest and sneakiest danger to predictive power is when the training data and conditions differ from the data and conditions at the time of model deployment. While there is no magic bullet, there are some principles and strategies that can help: First, awareness to this weakness. Second, monitoring and evaluating predictive power in different scenarios (robustness/sensitivity analysis) and over time. Third, re-training models over time as new data arrive.

Evaluating predictive power is a very powerful tool. We can learn not only about actual predictive power, but also get clues as to the strengths and weaknesses of the underlying model.

Tuesday, November 05, 2013

A Tale of Two (Business Analytics) Courses

I have been teaching two business analytics elective MBA-level courses at ISB. One is called "Business Analytics Using Data Mining" (BADM) and the other, "Forecasting Analytics" (FCAS). Although we share the syllabi for both courses, I often receive the following question, in this variant or the other:
What is the difference between the two courses?
The short answer is: BADM is focused on analyzing cross-sectional data, while FCAS is focused on time series data. This answer clarifies the issue to data miners and statisticians, but sometimes leaves aspiring data analytics students perplexed. So let me elaborate.

What is the difference between cross-sectional data and time series data?
Think photography. Cross-sectional data are like a snapshot in time. We might have a large dataset on a large set of customers, with their demographic information and their transactional information summarized in some form (e.g., number of visits thus far). Another example is a transactional dataset, with information on each transaction, perhaps including a flag of whether it was fraudulent. A third is movie ratings on an online movie rental website. You have probably encountered multiple examples of such datasets in the Statistics course. BADM introduces methods that use cross-sectional data for predicting the outcomes for new records. In contrast, time series data are like a video, where you collect data over time. Our focus will be on approaches and methods for forecasting a series into the future. Data examples include daily traffic, weekly demand, monthly disease outbreaks, and so forth. 
How are the courses similar?
The two courses are similar in terms of flavor and focus: they both introduce the notion of business analytics, where you identify business opportunities and challenges that can be potentially be tackled with data mining or statistical tools. They are both technical courses, not in the mathematical sense, but rather that we do hands-on work (and a team project) with real data, learning and applying different techniques, and experiencing the entire process from business problem definition to deployment back into the business environment.
In both courses, a team project is pivotal. Teams use real data to tackle a potentially real business problem/opportunity. You can browse presentations and reports from previous years to get an idea. We also use the same software packages in both courses, called XLMiner and TIBCO Spotfire. For those on the Hyderabad campus, BADM and FCAS students will see the same instructor in both courses this year (yes, that's me).
How do the courses differ in terms of delivery?
Since last year, I have "flipped" BADM and turned it into a MOOC-style course. This means that students are expected to do some work online before each class, so that in class we can focus on hands-on data mining, higher level discussions, and more. The online component will also be open to the larger community, where students can interact with alumni and others interested in analytics. FCAS is still offered in the more traditional lecture-style mode.
Is there overlap between the courses?
While the two courses share the data mining flavor and the general business analytics approaches, they have very little overlap in terms of methods, and even then, the implementations are different. For example, while we use linear regression in both cases, it is used in different ways when predicting with cross-sectional data vs. forecasting with time series.
So which course should I take? Should I take both?
Being completely biased, it's difficult for me to tell you not to take any one of these courses. However, I will say that these courses require a large time and effort investment. If you are taking other heavy courses this term, you might want to stick with only one of BADM or FCAS. Taking the two courses will give you a stronger and broader skill set in data analytics, so for those interested in working in the business analytics field, I'd suggest taking both. Finally, if you register for FCAS only, you'll still be able to join the online component for BADM without registering. Although it's not as extensive as taking the course, you'll be able to get a glimpse of data mining with cross-sectional data.
Finally, a historical note: when I taught a similar course at the University of Maryland (in 2004-2010), it was a 14-week semester-long course. In that course, which was mostly focused on cross-sectional methods, I included a chunk on forecasting, so it was a mix. However, the separation into two dedicated courses is more coherent, gives more depth, does more justice to these extremely useful methods and approaches, and allows gaining first-hand experience in the uses of these different types of data structures that are commonly encountered in any organization.

Thursday, August 15, 2013

Designing a Business Analytics program, Part 3: Structure

This post continues two earlier posts (Part 1: Intro and Part 2: Content) on Designing a Business Analytics (BA) program. This part focuses on the structure of a BA program, and especially course structure.

In the program that I designed, each of the 16 courses combines on-ground sessions with online components. Importantly, the opening and closing of a course should be on-ground.

The hybrid online/on-ground design is intended to accommodate participants who cannot take long periods of time-off to attend campus. Yet, even in a residential program, a hybrid structure can be more effective, if it is properly implemented. The reason is that a hybrid model is more similar to the real-world functioning of an analyst. At the start and end of a project, close communication is needed with the domain experts and stakeholders to assure that everyone is clear about the goals and the implications. In between these touch points, the analytics group works "offline" (building models, evaluating, testing, going back and forth) while communicating among the group and from time to time with the domain people.

A hybrid "sandwich" BA program can be set up to mimic this process:
  • The on-ground sessions at the start and end of each course help set the stage and expectations, build communication channels between the instructor and participants as well as among participants; at the close of a course, participants present their work and receive peer and instructor feedback.
  • The online components guide participants (and teams of participants) through the skill development and knowledge acquisition that the course aims at. Working through a live project, participants can acquire the needed knowledge (1) via lecture videos, textbook readings, case studies and articles, software tutorials and more, (2) via self-assessment and small deliverables that build up needed proficiency, and (3) a live online discussion board where participants are required to ask, answer, discuss and share experiences, challenges and discoveries. If designing and implementing the online component is beyond the realm of the institution, it is possible to integrate existing successful online courses, such as those offered on Statistics.com or on Coursera, EdX and other established online course providers.
For example, in a Predictive Analytics course, a major component is a team project with real data, solving a potentially real problem. The on-ground sessions would focus on translating a business problem into an analytics problem and setting the expectations and stage for the process the teams will be going through. Teams would submit proposals and discuss with the instructor to assure feasibility and determine the way forward. The online components would include short lecture videos, textbook reading, short individual assignments to master software and technique, and a vibrant online discussion board with topics at different technical and business levels (this is similar to my semi-MOOC course Business Analytics Using Data Mining). In the closing on-ground sessions, teams present their work to the entire group and discuss challenges and insights; each team might meet with the instructor to receive feedback and do a second round of improvement. Finally, an integrative session would provide closure and linkage to other courses.