Thursday, November 28, 2013

Running a data mining contest on Kaggle

Following the success last year, I've decided once again to introduce a data mining contest in my Business Analytics using Data Mining course at the Indian School of Business. Last year, I used two platforms: CrowdAnalytix and Kaggle. This year I am again using Kaggle. They offer free competition hosting for university instructors, called InClass Kaggle.

Setting up a competition on Kaggle is not trivial and I'd like to share some tips that I discovered to help fellow colleagues. Even if you successfully hosted a Kaggle contest a while ago, some things have changed (as I've discovered). With some assistance from the Kaggle support team, who are extremely helpful, I was able to decipher the process. So here goes:

Step #1: get your dataset into the right structure. Your initial dataset should include input and output columns for all records (assuming that the goal is to predict the outcome from the inputs). It should also include an ID column with running index numbers.

  • Save this as an Excel or CSV file. 
  • Split the records into two datasets: a training set and a test set. 
  • Keep the training and test datasets in separate CSV files. For the test set, remove the outcome column(s).
  • Kaggle will split the test set into a private and public subsets. It will score each of them separately. Results for the public records will appear in the leaderboard. Only you will see the results for the private subsets. If you want to assign the records yourself to public/private, create a column Usage in the test dataset and type Private or Public for each record.

Step #2: open a Kaggle InClass account and start a competition using the wizard. Filling in the Basic Details and Entry & Rules pages is straightforward.

Step #3: The tricky page is Your Data. Here you'll need to follow the following sequence in order to get it working:

  1. Choose the evaluation metric to be used in the competition. Kaggle has a bunch of different metrics to choose from. In my two Kaggle contests, I actually wanted a metric that was not on the list, and voila! the support team was able to help by activating a metric that was not generally available for my competition. Last year I used a lift-type measure. This year it is an average-cost-per-observation metric for a binary classification task. In short, if you don't find exactly what you're looking for, it is worth asking the folks at Kaggle.
  2. After the evaluation metric is set, upload a solution file (CSV format). This file should include only an ID column (with the IDs for all the records that participants should score), and the outcome column(s). If you include any other columns, you'll get error messages. The first row of your file should include the names of these columns.
  3. After you've uploaded a solutions file, you'll be able to see whether it was successful or not. Aside from error messages, you can see your uploaded files. Scroll to the bottom and you'll see the file that you've submitted; or if you submitted multiple times, you'll see all the submitted files; if you selected a random public/private partition, the "derived solution" file will include an extra column with labels "public" and "private". It's a good idea to download this file, so that you can later compare your results with the system.
  4. After the solution file has been successfully uploaded and its columns mapped, you must upload a "sample submission file". This file is used to map the columns in the solutions file with what needs to be measured by Kaggle. The file should include an ID column like that in the solution file, plus a column with the predictions. Nothing more, nothing less. Again, the first row should include the column names. You'll have an option to define rules about allowed values for these columns.
  5. After successfully submitting the sample submission file, you will be able to test the system by submitting (mock) solutions in the "submission playground". One good test is using the naive rule (in a classification task, submit all 0s or all 1s). Compare your result to the one on Kaggle to make sure everything is set up properly.
  6. Finally, in the "Additional data files" you upload the two data files: the training dataset (which includes the ID, input and output columns) and the test dataset (which includes the ID and input columns). It is also useful to upload a third file, which contains a sample valid submission. This will help participants see what their file should look like, and they can also try submitting this file to see how the system works. You can use the naive-rule submission file that you created earlier to test the system.
  7. That's it! The rest (Documentation, Preview and Overview) are quite straightforward. After you're done, you'll see a button "submit for review". You can also share the contest with another colleague prior to releasing it. Look for "Share this competition wizard with a coworker" on the Basic Details page.
If I've missed tips or tricks that others have used, please do share. My current competition, "predicting cab booking cancellation" (using real data from YourCabs in Bangalore) has just started, and it will be open not only to our students, but to the world. 
Submission deadline: Midnight Dec 22, 2013, India Standard Time. All welcome!

Thursday, November 21, 2013

The Scientific Value of Testing Predictive Performance

This week's NY Times article Risk Calculator for Cholesterol Appears Flawed and CNN article Does calculator overstate heart attack risk? illustrate the power of evaluating the predictive performance of a model for purposes of validating the underlying theory.

The NYT article describes findings by two Harvard Medical School professors, Ridker and Cook, about extreme over-estimation of the 10-year risk of a heart-attack or stroke when using a calculator released by the American Heart Association and the American College of Cardiology.
"According to the new guidelines, if a person's risk is above 7.5%, he or she should be put on a statin." (CNN article)
Over-estimation in this case is likely to lead to over-prescription of therapies such as cholesterol-lowering statin drugs, not to mention the psychological effect of being classified as high risk for a heart-attack or stroke.

How was this over-prediction discovered? 
"Dr. Ridker and Dr. Cook evaluated [the calculator] using three large studies that involved thousands of people and continued for at least a decade. They knew the subjects’ characteristics at the start — their ages, whether they smoked, their cholesterol levels, their blood pressures. Then they asked how many had heart attacks or strokes in the next 10 years and how many would the risk calculator predict."
In other words, the "model" (=calculator) was deployed to a large labeled dataset, and the actual and predicted rates of heart attacks were compared. This is the classic "holdout set" approach. The results are nicely shown in the article's chart, overlaying the actual and predicted values in histograms of risk:

Chart from NY Times article 

Beyond the practical usefulness of detecting the flaw in the calculators, evaluating predictive performance tells us something about the underlying model. A next natural question is "why?", or how was the calculator/model built?

The NYT article quotes Dr. Smith, a professor of medicine at the University of North Carolina and a past president of the American Heart Association:
“a lot of people put a lot of thought into how can we identify people who can benefit from therapy... What we have come forward with represents the best efforts of people who have been working for five years.”
Although this statement seems to imply that the guidelines are based on an informal qualitative integration of domain knowledge and experience, I am guessing (and hoping) that there is a sound data-based model behind the scenes. The fact that the calculator uses very few and coarse predictors makes me suspicious that the model was not designed or optimized for "personalized medicine".

One reason mentioned for the extreme over-prediction of this model on the three studies data is the difference between the population used to "train the calculator" (generate the guidelines) and the population in the evaluation studies in terms of the relationship between heart-attacks/strokes and the risk factors:
"The problem might have stemmed from the fact that the calculator uses as reference points data collected more than a decade ago, when more people smoked and had strokes and heart attacks earlier in life. For example, the guideline makers used data from studies in the 1990s to determine how various risk factors like cholesterol levels and blood pressure led to actual heart attacks and strokes over a decade of observation.
But people have changed in the past few decades, Dr. Blaha said. Among other things, there is no longer such a big gap between women’s risks and those of men at a given age. And people get heart attacks and strokes at older ages."
In predictive analytics, we know that the biggest and sneakiest danger to predictive power is when the training data and conditions differ from the data and conditions at the time of model deployment. While there is no magic bullet, there are some principles and strategies that can help: First, awareness to this weakness. Second, monitoring and evaluating predictive power in different scenarios (robustness/sensitivity analysis) and over time. Third, re-training models over time as new data arrive.

Evaluating predictive power is a very powerful tool. We can learn not only about actual predictive power, but also get clues as to the strengths and weaknesses of the underlying model.

Tuesday, November 05, 2013

A Tale of Two (Business Analytics) Courses

I have been teaching two business analytics elective MBA-level courses at ISB. One is called "Business Analytics Using Data Mining" (BADM) and the other, "Forecasting Analytics" (FCAS). Although we share the syllabi for both courses, I often receive the following question, in this variant or the other:
What is the difference between the two courses?
The short answer is: BADM is focused on analyzing cross-sectional data, while FCAS is focused on time series data. This answer clarifies the issue to data miners and statisticians, but sometimes leaves aspiring data analytics students perplexed. So let me elaborate.

What is the difference between cross-sectional data and time series data?
Think photography. Cross-sectional data are like a snapshot in time. We might have a large dataset on a large set of customers, with their demographic information and their transactional information summarized in some form (e.g., number of visits thus far). Another example is a transactional dataset, with information on each transaction, perhaps including a flag of whether it was fraudulent. A third is movie ratings on an online movie rental website. You have probably encountered multiple examples of such datasets in the Statistics course. BADM introduces methods that use cross-sectional data for predicting the outcomes for new records. In contrast, time series data are like a video, where you collect data over time. Our focus will be on approaches and methods for forecasting a series into the future. Data examples include daily traffic, weekly demand, monthly disease outbreaks, and so forth. 
How are the courses similar?
The two courses are similar in terms of flavor and focus: they both introduce the notion of business analytics, where you identify business opportunities and challenges that can be potentially be tackled with data mining or statistical tools. They are both technical courses, not in the mathematical sense, but rather that we do hands-on work (and a team project) with real data, learning and applying different techniques, and experiencing the entire process from business problem definition to deployment back into the business environment.
In both courses, a team project is pivotal. Teams use real data to tackle a potentially real business problem/opportunity. You can browse presentations and reports from previous years to get an idea. We also use the same software packages in both courses, called XLMiner and TIBCO Spotfire. For those on the Hyderabad campus, BADM and FCAS students will see the same instructor in both courses this year (yes, that's me).
How do the courses differ in terms of delivery?
Since last year, I have "flipped" BADM and turned it into a MOOC-style course. This means that students are expected to do some work online before each class, so that in class we can focus on hands-on data mining, higher level discussions, and more. The online component will also be open to the larger community, where students can interact with alumni and others interested in analytics. FCAS is still offered in the more traditional lecture-style mode.
Is there overlap between the courses?
While the two courses share the data mining flavor and the general business analytics approaches, they have very little overlap in terms of methods, and even then, the implementations are different. For example, while we use linear regression in both cases, it is used in different ways when predicting with cross-sectional data vs. forecasting with time series.
So which course should I take? Should I take both?
Being completely biased, it's difficult for me to tell you not to take any one of these courses. However, I will say that these courses require a large time and effort investment. If you are taking other heavy courses this term, you might want to stick with only one of BADM or FCAS. Taking the two courses will give you a stronger and broader skill set in data analytics, so for those interested in working in the business analytics field, I'd suggest taking both. Finally, if you register for FCAS only, you'll still be able to join the online component for BADM without registering. Although it's not as extensive as taking the course, you'll be able to get a glimpse of data mining with cross-sectional data.
Finally, a historical note: when I taught a similar course at the University of Maryland (in 2004-2010), it was a 14-week semester-long course. In that course, which was mostly focused on cross-sectional methods, I included a chunk on forecasting, so it was a mix. However, the separation into two dedicated courses is more coherent, gives more depth, does more justice to these extremely useful methods and approaches, and allows gaining first-hand experience in the uses of these different types of data structures that are commonly encountered in any organization.