BzST | Business Analytics, Statistics, Teaching: data liberation

Showing posts with label data liberation. Show all posts

Thursday, November 28, 2013

Running a data mining contest on Kaggle

Following the success last year, I've decided once again to introduce a data mining contest in my Business Analytics using Data Mining course at the Indian School of Business. Last year, I used two platforms: CrowdAnalytix and Kaggle. This year I am again using Kaggle. They offer free competition hosting for university instructors, called InClass Kaggle.

Setting up a competition on Kaggle is not trivial and I'd like to share some tips that I discovered to help fellow colleagues. Even if you successfully hosted a Kaggle contest a while ago, some things have changed (as I've discovered). With some assistance from the Kaggle support team, who are extremely helpful, I was able to decipher the process. So here goes:

Step #1: get your dataset into the right structure. Your initial dataset should include input and output columns for all records (assuming that the goal is to predict the outcome from the inputs). It should also include an ID column with running index numbers.

Save this as an Excel or CSV file.
Split the records into two datasets: a training set and a test set.
Keep the training and test datasets in separate CSV files. For the test set, remove the outcome column(s).
Kaggle will split the test set into a private and public subsets. It will score each of them separately. Results for the public records will appear in the leaderboard. Only you will see the results for the private subsets. If you want to assign the records yourself to public/private, create a column Usage in the test dataset and type Private or Public for each record.

Step #2: open a Kaggle InClass account and start a competition using the wizard. Filling in the Basic Details and Entry & Rules pages is straightforward.

Step #3: The tricky page is Your Data. Here you'll need to follow the following sequence in order to get it working:

Choose the evaluation metric to be used in the competition. Kaggle has a bunch of different metrics to choose from. In my two Kaggle contests, I actually wanted a metric that was not on the list, and voila! the support team was able to help by activating a metric that was not generally available for my competition. Last year I used a lift-type measure. This year it is an average-cost-per-observation metric for a binary classification task. In short, if you don't find exactly what you're looking for, it is worth asking the folks at Kaggle.
After the evaluation metric is set, upload a solution file (CSV format). This file should include only an ID column (with the IDs for all the records that participants should score), and the outcome column(s). If you include any other columns, you'll get error messages. The first row of your file should include the names of these columns.
After you've uploaded a solutions file, you'll be able to see whether it was successful or not. Aside from error messages, you can see your uploaded files. Scroll to the bottom and you'll see the file that you've submitted; or if you submitted multiple times, you'll see all the submitted files; if you selected a random public/private partition, the "derived solution" file will include an extra column with labels "public" and "private". It's a good idea to download this file, so that you can later compare your results with the system.
After the solution file has been successfully uploaded and its columns mapped, you must upload a "sample submission file". This file is used to map the columns in the solutions file with what needs to be measured by Kaggle. The file should include an ID column like that in the solution file, plus a column with the predictions. Nothing more, nothing less. Again, the first row should include the column names. You'll have an option to define rules about allowed values for these columns.
After successfully submitting the sample submission file, you will be able to test the system by submitting (mock) solutions in the "submission playground". One good test is using the naive rule (in a classification task, submit all 0s or all 1s). Compare your result to the one on Kaggle to make sure everything is set up properly.
Finally, in the "Additional data files" you upload the two data files: the training dataset (which includes the ID, input and output columns) and the test dataset (which includes the ID and input columns). It is also useful to upload a third file, which contains a sample valid submission. This will help participants see what their file should look like, and they can also try submitting this file to see how the system works. You can use the naive-rule submission file that you created earlier to test the system.
That's it! The rest (Documentation, Preview and Overview) are quite straightforward. After you're done, you'll see a button "submit for review". You can also share the contest with another colleague prior to releasing it. Look for "Share this competition wizard with a coworker" on the Basic Details page.

If I've missed tips or tricks that others have used, please do share. My current competition, "predicting cab booking cancellation" (using real data from YourCabs in Bangalore) has just started, and it will be open not only to our students, but to the world.

Submission deadline: Midnight Dec 22, 2013, India Standard Time. All welcome!

Tuesday, May 14, 2013

An Appeal to Companies: Leave the Data Behind @kaggle @crowdanalytix_q

A while ago I wrote about the wonderful new age of real-data-made-available-for-academic-use through the growing number of data mining contests on platforms such as Kaggle and CrowdANALYTIX. Such data provide excellent examples for courses on data mining that help train the next generation of data scientists, business analysts, and other data-savvy graduates.

A couple of years later, I am discovering the painful truth that many of these dataset are no longer available. The reason is most likely due to the company who shared the data pulling their data out. This unexpected twist is extremely harmful to both academia and industry: instructors and teachers who design and teach data mining courses build course plans, assignments, and projects based on the assumption that the data will be available. And now, we have to revise all our materials! What a waste of time, resources, and energy. Obviously, after the first time this happens, I think twice whether to use contest data for teaching. I will not try to convince you of the waste of faculty time, the loss to students, or even the loss to self-learners who are now all over the globe. I'll wear my b-school professor hat and speak the ROI language: The companies lose. In the short term, all these students will not be competing in their contest. The bigger loss, however, is to the entire business sector in the longer term: I constantly receive urgent requests from businesses and large corporations for business analytics trained graduates. "We are in desperate need of data-savvy managers! Can you help us find good data scientists?". Yet, some of these same companies are pulling their data out of the hands of instructors and students.

I am perplexed as to why a company would pull their data away after the data was publicly available for a long time. It's not the fear of competitors, since the data were already publicly available. It's probably not the contest platform CEOs - they'd love the traffic. Then why? I smell lawyers...

One example is the beautiful Heritage Health contest that ran on Kaggle for 2 years. Now, there is no way to get the data, even if you were a registered contestant in the past (see screenshots). Yet, the healthcare sector has been begging for help in training "healthcare analytics" professionals.

Data access is no longer available. Although it was for 2 years.

I'd like to send out an appeal to Heritage Provider Network and other companies that have shared invaluable data through a publicly-available contest platform. Please consider making the data publicly available forever. We will then be able to integrate it into data analytics courses, textbooks, research, and projects, thereby providing industry with not only insights, but also properly trained students who've learned using real data and real problems.

This time, industry can contribute in a major way to reducing the academia-industry gap. Even if only for their own good.

Tuesday, March 13, 2012

Data liberation via visualization

"Data democratization" movements try to make data, and especially government-held data, publicly available and accessible. A growing number of technological efforts are devoted to such efforts and especially the accessibility part. One such effort is by data visualization companies. A recent trend is to offer a free version (or at least free for some period) that is based on sharing your visualization and/or data to the Web. The "and/or" here is important, because in some cases you cannot share your data, but you would like to share the visualizations with the world. This is what I call "data liberation via visualization". This is the case with proprietary data, and often even if I'd love to make data publicly available, I am not allowed to do so by binding contracts.

As part of a "data liberation via visualization" initiative, I went in search of a good free solution for disseminating interactive visualization dashboards while protecting the actual data. Two main free viz players in the market are TIBCO Spotfire Silver (free 1-year license Personal version), and Tableau Public (free). Both allow *only* public posting of your visualizations (if you want to save visualizations privately you must get the paid versions). That's fine. However, public posting of visualizations with these tools comes with a download button that make your data public as well.

I then tried MicroStrategy Cloud Personal (free Beta version), which does allow public (and private!) posting of visualizations and does not provide a download button. Of course, in order to make visualizations public, the data must sit on a server that can be reached from the visualization. All the free public-posting tools keep your data on the company's servers, so you must trust the company to protect the confidentiality and safety of your data. MicroStrategy uses a technology where the company itself cannot download your data (your Excel sheet is converted to in-memory cubes that are stored on the server). Unfortunately, the tool lacks the ability to create dashboards with multiple charts (combining multiple charts into a fully-linked interactive view).

Speaking of features, Tableau Public is the only one that has full-fledged functionality like its cousin paid tools. Spotfire Silver Personal is stripped from highly useful charts such as scatterplots and boxplots. MicroStrategy Cloud Personal lacks multi-view dashboards and for now accepts only Excel files as input.