Thursday, November 28, 2013

Running a data mining contest on Kaggle

Following the success last year, I've decided once again to introduce a data mining contest in my Business Analytics using Data Mining course at the Indian School of Business. Last year, I used two platforms: CrowdAnalytix and Kaggle. This year I am again using Kaggle. They offer free competition hosting for university instructors, called InClass Kaggle.

Setting up a competition on Kaggle is not trivial and I'd like to share some tips that I discovered to help fellow colleagues. Even if you successfully hosted a Kaggle contest a while ago, some things have changed (as I've discovered). With some assistance from the Kaggle support team, who are extremely helpful, I was able to decipher the process. So here goes:

Step #1: get your dataset into the right structure. Your initial dataset should include input and output columns for all records (assuming that the goal is to predict the outcome from the inputs). It should also include an ID column with running index numbers.

  • Save this as an Excel or CSV file. 
  • Split the records into two datasets: a training set and a test set. 
  • Keep the training and test datasets in separate CSV files. For the test set, remove the outcome column(s).
  • Kaggle will split the test set into a private and public subsets. It will score each of them separately. Results for the public records will appear in the leaderboard. Only you will see the results for the private subsets. If you want to assign the records yourself to public/private, create a column Usage in the test dataset and type Private or Public for each record.

Step #2: open a Kaggle InClass account and start a competition using the wizard. Filling in the Basic Details and Entry & Rules pages is straightforward.

Step #3: The tricky page is Your Data. Here you'll need to follow the following sequence in order to get it working:

  1. Choose the evaluation metric to be used in the competition. Kaggle has a bunch of different metrics to choose from. In my two Kaggle contests, I actually wanted a metric that was not on the list, and voila! the support team was able to help by activating a metric that was not generally available for my competition. Last year I used a lift-type measure. This year it is an average-cost-per-observation metric for a binary classification task. In short, if you don't find exactly what you're looking for, it is worth asking the folks at Kaggle.
  2. After the evaluation metric is set, upload a solution file (CSV format). This file should include only an ID column (with the IDs for all the records that participants should score), and the outcome column(s). If you include any other columns, you'll get error messages. The first row of your file should include the names of these columns.
  3. After you've uploaded a solutions file, you'll be able to see whether it was successful or not. Aside from error messages, you can see your uploaded files. Scroll to the bottom and you'll see the file that you've submitted; or if you submitted multiple times, you'll see all the submitted files; if you selected a random public/private partition, the "derived solution" file will include an extra column with labels "public" and "private". It's a good idea to download this file, so that you can later compare your results with the system.
  4. After the solution file has been successfully uploaded and its columns mapped, you must upload a "sample submission file". This file is used to map the columns in the solutions file with what needs to be measured by Kaggle. The file should include an ID column like that in the solution file, plus a column with the predictions. Nothing more, nothing less. Again, the first row should include the column names. You'll have an option to define rules about allowed values for these columns.
  5. After successfully submitting the sample submission file, you will be able to test the system by submitting (mock) solutions in the "submission playground". One good test is using the naive rule (in a classification task, submit all 0s or all 1s). Compare your result to the one on Kaggle to make sure everything is set up properly.
  6. Finally, in the "Additional data files" you upload the two data files: the training dataset (which includes the ID, input and output columns) and the test dataset (which includes the ID and input columns). It is also useful to upload a third file, which contains a sample valid submission. This will help participants see what their file should look like, and they can also try submitting this file to see how the system works. You can use the naive-rule submission file that you created earlier to test the system.
  7. That's it! The rest (Documentation, Preview and Overview) are quite straightforward. After you're done, you'll see a button "submit for review". You can also share the contest with another colleague prior to releasing it. Look for "Share this competition wizard with a coworker" on the Basic Details page.
If I've missed tips or tricks that others have used, please do share. My current competition, "predicting cab booking cancellation" (using real data from YourCabs in Bangalore) has just started, and it will be open not only to our students, but to the world. 
Submission deadline: Midnight Dec 22, 2013, India Standard Time. All welcome!

Thursday, November 21, 2013

The Scientific Value of Testing Predictive Performance

This week's NY Times article Risk Calculator for Cholesterol Appears Flawed and CNN article Does calculator overstate heart attack risk? illustrate the power of evaluating the predictive performance of a model for purposes of validating the underlying theory.

The NYT article describes findings by two Harvard Medical School professors, Ridker and Cook, about extreme over-estimation of the 10-year risk of a heart-attack or stroke when using a calculator released by the American Heart Association and the American College of Cardiology.
"According to the new guidelines, if a person's risk is above 7.5%, he or she should be put on a statin." (CNN article)
Over-estimation in this case is likely to lead to over-prescription of therapies such as cholesterol-lowering statin drugs, not to mention the psychological effect of being classified as high risk for a heart-attack or stroke.

How was this over-prediction discovered? 
"Dr. Ridker and Dr. Cook evaluated [the calculator] using three large studies that involved thousands of people and continued for at least a decade. They knew the subjects’ characteristics at the start — their ages, whether they smoked, their cholesterol levels, their blood pressures. Then they asked how many had heart attacks or strokes in the next 10 years and how many would the risk calculator predict."
In other words, the "model" (=calculator) was deployed to a large labeled dataset, and the actual and predicted rates of heart attacks were compared. This is the classic "holdout set" approach. The results are nicely shown in the article's chart, overlaying the actual and predicted values in histograms of risk:

Chart from NY Times article 

Beyond the practical usefulness of detecting the flaw in the calculators, evaluating predictive performance tells us something about the underlying model. A next natural question is "why?", or how was the calculator/model built?

The NYT article quotes Dr. Smith, a professor of medicine at the University of North Carolina and a past president of the American Heart Association:
“a lot of people put a lot of thought into how can we identify people who can benefit from therapy... What we have come forward with represents the best efforts of people who have been working for five years.”
Although this statement seems to imply that the guidelines are based on an informal qualitative integration of domain knowledge and experience, I am guessing (and hoping) that there is a sound data-based model behind the scenes. The fact that the calculator uses very few and coarse predictors makes me suspicious that the model was not designed or optimized for "personalized medicine".

One reason mentioned for the extreme over-prediction of this model on the three studies data is the difference between the population used to "train the calculator" (generate the guidelines) and the population in the evaluation studies in terms of the relationship between heart-attacks/strokes and the risk factors:
"The problem might have stemmed from the fact that the calculator uses as reference points data collected more than a decade ago, when more people smoked and had strokes and heart attacks earlier in life. For example, the guideline makers used data from studies in the 1990s to determine how various risk factors like cholesterol levels and blood pressure led to actual heart attacks and strokes over a decade of observation.
But people have changed in the past few decades, Dr. Blaha said. Among other things, there is no longer such a big gap between women’s risks and those of men at a given age. And people get heart attacks and strokes at older ages."
In predictive analytics, we know that the biggest and sneakiest danger to predictive power is when the training data and conditions differ from the data and conditions at the time of model deployment. While there is no magic bullet, there are some principles and strategies that can help: First, awareness to this weakness. Second, monitoring and evaluating predictive power in different scenarios (robustness/sensitivity analysis) and over time. Third, re-training models over time as new data arrive.

Evaluating predictive power is a very powerful tool. We can learn not only about actual predictive power, but also get clues as to the strengths and weaknesses of the underlying model.

Tuesday, November 05, 2013

A Tale of Two (Business Analytics) Courses

I have been teaching two business analytics elective MBA-level courses at ISB. One is called "Business Analytics Using Data Mining" (BADM) and the other, "Forecasting Analytics" (FCAS). Although we share the syllabi for both courses, I often receive the following question, in this variant or the other:
What is the difference between the two courses?
The short answer is: BADM is focused on analyzing cross-sectional data, while FCAS is focused on time series data. This answer clarifies the issue to data miners and statisticians, but sometimes leaves aspiring data analytics students perplexed. So let me elaborate.

What is the difference between cross-sectional data and time series data?
Think photography. Cross-sectional data are like a snapshot in time. We might have a large dataset on a large set of customers, with their demographic information and their transactional information summarized in some form (e.g., number of visits thus far). Another example is a transactional dataset, with information on each transaction, perhaps including a flag of whether it was fraudulent. A third is movie ratings on an online movie rental website. You have probably encountered multiple examples of such datasets in the Statistics course. BADM introduces methods that use cross-sectional data for predicting the outcomes for new records. In contrast, time series data are like a video, where you collect data over time. Our focus will be on approaches and methods for forecasting a series into the future. Data examples include daily traffic, weekly demand, monthly disease outbreaks, and so forth. 
How are the courses similar?
The two courses are similar in terms of flavor and focus: they both introduce the notion of business analytics, where you identify business opportunities and challenges that can be potentially be tackled with data mining or statistical tools. They are both technical courses, not in the mathematical sense, but rather that we do hands-on work (and a team project) with real data, learning and applying different techniques, and experiencing the entire process from business problem definition to deployment back into the business environment.
In both courses, a team project is pivotal. Teams use real data to tackle a potentially real business problem/opportunity. You can browse presentations and reports from previous years to get an idea. We also use the same software packages in both courses, called XLMiner and TIBCO Spotfire. For those on the Hyderabad campus, BADM and FCAS students will see the same instructor in both courses this year (yes, that's me).
How do the courses differ in terms of delivery?
Since last year, I have "flipped" BADM and turned it into a MOOC-style course. This means that students are expected to do some work online before each class, so that in class we can focus on hands-on data mining, higher level discussions, and more. The online component will also be open to the larger community, where students can interact with alumni and others interested in analytics. FCAS is still offered in the more traditional lecture-style mode.
Is there overlap between the courses?
While the two courses share the data mining flavor and the general business analytics approaches, they have very little overlap in terms of methods, and even then, the implementations are different. For example, while we use linear regression in both cases, it is used in different ways when predicting with cross-sectional data vs. forecasting with time series.
So which course should I take? Should I take both?
Being completely biased, it's difficult for me to tell you not to take any one of these courses. However, I will say that these courses require a large time and effort investment. If you are taking other heavy courses this term, you might want to stick with only one of BADM or FCAS. Taking the two courses will give you a stronger and broader skill set in data analytics, so for those interested in working in the business analytics field, I'd suggest taking both. Finally, if you register for FCAS only, you'll still be able to join the online component for BADM without registering. Although it's not as extensive as taking the course, you'll be able to get a glimpse of data mining with cross-sectional data.
Finally, a historical note: when I taught a similar course at the University of Maryland (in 2004-2010), it was a 14-week semester-long course. In that course, which was mostly focused on cross-sectional methods, I included a chunk on forecasting, so it was a mix. However, the separation into two dedicated courses is more coherent, gives more depth, does more justice to these extremely useful methods and approaches, and allows gaining first-hand experience in the uses of these different types of data structures that are commonly encountered in any organization.

Thursday, August 15, 2013

Designing a Business Analytics program, Part 3: Structure

This post continues two earlier posts (Part 1: Intro and Part 2: Content) on Designing a Business Analytics (BA) program. This part focuses on the structure of a BA program, and especially course structure.

In the program that I designed, each of the 16 courses combines on-ground sessions with online components. Importantly, the opening and closing of a course should be on-ground.

The hybrid online/on-ground design is intended to accommodate participants who cannot take long periods of time-off to attend campus. Yet, even in a residential program, a hybrid structure can be more effective, if it is properly implemented. The reason is that a hybrid model is more similar to the real-world functioning of an analyst. At the start and end of a project, close communication is needed with the domain experts and stakeholders to assure that everyone is clear about the goals and the implications. In between these touch points, the analytics group works "offline" (building models, evaluating, testing, going back and forth) while communicating among the group and from time to time with the domain people.

A hybrid "sandwich" BA program can be set up to mimic this process:
  • The on-ground sessions at the start and end of each course help set the stage and expectations, build communication channels between the instructor and participants as well as among participants; at the close of a course, participants present their work and receive peer and instructor feedback.
  • The online components guide participants (and teams of participants) through the skill development and knowledge acquisition that the course aims at. Working through a live project, participants can acquire the needed knowledge (1) via lecture videos, textbook readings, case studies and articles, software tutorials and more, (2) via self-assessment and small deliverables that build up needed proficiency, and (3) a live online discussion board where participants are required to ask, answer, discuss and share experiences, challenges and discoveries. If designing and implementing the online component is beyond the realm of the institution, it is possible to integrate existing successful online courses, such as those offered on Statistics.com or on Coursera, EdX and other established online course providers.
For example, in a Predictive Analytics course, a major component is a team project with real data, solving a potentially real problem. The on-ground sessions would focus on translating a business problem into an analytics problem and setting the expectations and stage for the process the teams will be going through. Teams would submit proposals and discuss with the instructor to assure feasibility and determine the way forward. The online components would include short lecture videos, textbook reading, short individual assignments to master software and technique, and a vibrant online discussion board with topics at different technical and business levels (this is similar to my semi-MOOC course Business Analytics Using Data Mining). In the closing on-ground sessions, teams present their work to the entire group and discuss challenges and insights; each team might meet with the instructor to receive feedback and do a second round of improvement. Finally, an integrative session would provide closure and linkage to other courses.

Designing a Business Analytics program, Part 2: Content

This post follows Part 1: Intro of Designing a Business Analytics program. In this post, I focus on the content to be covered in the program, in the form of courses and projects.

The following design is based on my research of many programs, on discussions with faculty in various analytics areas, with analysts and managers at different levels, and on feedback from many past MBA students who have taken my analytics courses over the years (data mining, forecasting, visualization, statistics, etc.) and are now managing data at a broad range of companies and organizations.

Content
Dealing with data, little or mountains, and being able to tackle an array of business challenges and opportunities, requires a broad and diverse set of tools and approaches. From data access and management to modeling, assessment and deployment requires a skill set that derives from the fields of statistics, computer science, operations research, and more. In addition, one needs integrative and "big picture" thinking and effective communication skills. Here is a list of 16 courses, divided into four sets, that attempts to achieve such a skill set (by no means is this the only set - would love to hear comments):

Set I
  1. Analytic Thinking (what is a model? what is the role of a model? data in context and data-domain integration)
  2. Data Visualization (data exploration, interactive visualization, charts and dashboards, data presentation and effective communication, use of BI tools)
  3. Statistical Analysis 1: Estimation and inference (observational studies and experiments; estimating population means, proportions, and more; testing hypotheses regarding population numbers; using programming and menu-driven software)
  4. Statistical Analysis 2: Regression models (linear, logistic, ANOVA)
Set II
  1. Data Management 1: Database design and implementation, data warehousing
  2. Forecasting Analytics: Exploring and modeling time series
  3. Data Management 2: Big Data (Hadoop-MapReduce and more)
  4. Operations 1: Simulation (principles of simulation; Monte Carlo and Discrete Event simulation)
Set III
  1. Operations 2: Optimization (optimization techniques, sensitivity analysis, and more)
  2. Statistical Analysis 3: Advanced statistical models (censoring and truncation, modeling count data, handling missing values, design of experiments (A/B testing and beyond))
  3. Data Collection (Web data collection, online surveys, experiments)
  4. Data Mining 1: Supervised Learning - Predictive Analytics (predictive algorithms, evaluating predictive power, using software)
Set IV
  1. Data Mining 2: Unsupervised Learning (dimension reduction, clustering, association rules, recommender systems)
  2. Contemporary Analytics 1 (choose between: text mining, network analytics, social analytics, customer analytics, web analytics, risk analytics)
  3. Contemporary Analytics 2 (from the list above)
  4. Integrative Thinking (BA in different fields, choosing and integrating tools and analytic approaches into an effective solution)
The courses are divided into sets of four, where courses in each set can be offered in parallel. The order should take into account coverage of other courses and natural linkages.

Lastly: two industry team projects that require integrating skills from multiple courses should give participants the opportunity to interface with industry, test their skills in a more realistic setting, and gain initial experience and confidence to move forward on their own.

Continue to Part 3: Structure

Designing a Business Analytics program, Part 1: Intro

I have been receiving many inquiries about programs in "Business Analytics" (BA), online and offline, in the US and outside the US. The few programs that are already out there (see an earlier post) are relatively new, so it is difficult to assess their success in producing data-savvy analysts.

Rather than concentrate on the uncertainty, let me share my view and experience regarding the skill set that such programs should provide. To be practical, I will share the program that I designed for the Indian School of Business one-year certificate program in BA(*), in terms of content and structure. Both reflect the needed skills and knowledge that I believe make a valuable data analyst in a company. As well as a powerful consultant.

The program was designed for participants who have a few years of business experience and are planning to manage the data crunchers, but must acquire a solid knowledge of the crunchers' toolkit, and especially how it can be used effectively to tackle business goals, challenges and opportunities.

Business Analytics experts have a broad skill set
One important note: Although some universities and business schools are tempted to rename an existing operations or statistics program as a BA (or "Big Data" or "Data Science", etc) program, this will by no means supply the required diversity of skills. A program in BA should not look like a statistics program. It also should not look like a program in operations research. The key is therefore a combination of courses from different areas (statistics and operations among them), which usually requires experts from across campus. In a recent post by visualization expert Nathan Yaw, he comments on the need to know more than just visualization to be successful in the field ("It still surprises me how little statistics visualization people know... Look at job listings though, and most employers list it in the required skill set, so it's a big plus for you hiring-wise.")

The next two posts describe the content and structure of the program.

Continue to Part 2: Structure

(*) The final program structure and content at ISB were modified by the program administrator to accommodate constraints and shortages.

Friday, August 09, 2013

Predictive relationships and A/B testing

I recently watched an interesting webinar on Seeking the Magic Optimization Metric: When Complex Relationships Between Predictors Lead You Astray by Kelly Uphoff, manager of experimental analytics at Netflix. The presenter mentioned that Netflix is a heavy user of A/B testing for experimentation, and in this talk focused on the goal of optimizing retention.

In ideal A/B testing, the company would test the effect of an intervention of choice (such as displaying a promotion on their website) on retention, by assigning it to a random sample of users, and then comparing retention of the intervention group to that of a control group that was not subject to the intervention. This experimental setup can help infer a causal effect of the treatment on retention. The problem is that the information on retention can take long to measure -- if retention is defined as "customer paid for the next 6 months", you have to wait 6 months before you can determine the outcome.

Tuesday, May 14, 2013

An Appeal to Companies: Leave the Data Behind @kaggle @crowdanalytix_q

A while ago I wrote about the wonderful new age of real-data-made-available-for-academic-use through the growing number of data mining contests on platforms such as Kaggle and CrowdANALYTIX. Such data provide excellent examples for courses on data mining that help train the next generation of data scientists, business analysts, and other data-savvy graduates. 

A couple of years later, I am discovering the painful truth that many of these dataset are no longer available. The reason is most likely due to the company who shared the data pulling their data out. This unexpected twist is extremely harmful to both academia and industry: instructors and teachers who design and teach data mining courses build course plans, assignments, and projects based on the assumption that the data will be available. And now, we have to revise all our materials! What a waste of time, resources, and energy. Obviously, after the first time this happens, I think twice whether to use contest data for teaching. I will not try to convince you of the waste of faculty time, the loss to students, or even the loss to self-learners who are now all over the globe. I'll wear my b-school professor hat and speak the ROI language: The companies lose. In the short term, all these students will not be competing in their contest. The bigger loss, however, is to the entire business sector in the longer term: I constantly receive urgent requests from businesses and large corporations for business analytics trained graduates. "We are in desperate need of data-savvy managers! Can you help us find good data scientists?". Yet, some of these same companies are pulling their data out of the hands of instructors and students.

I am perplexed as to why a company would pull their data away after the data was publicly available for a long time. It's not the fear of competitors, since the data were already publicly available. It's probably not the contest platform CEOs - they'd love the traffic. Then why? I smell lawyers...

One example is the beautiful Heritage Health contest that ran on Kaggle for 2 years. Now, there is no way to get the data, even if you were a registered contestant in the past (see screenshots). Yet, the healthcare sector has been begging for help in training "healthcare analytics" professionals. 


Data access is no longer available. Although it was for 2 years.

I'd like to send out an appeal to Heritage Provider Network and other companies that have shared invaluable data through a publicly-available contest platform. Please consider making the data publicly available forever. We will then be able to integrate it into data analytics courses, textbooks, research, and projects, thereby providing industry with not only insights, but also properly trained students who've learned using real data and real problems.

This time, industry can contribute in a major way to reducing the academia-industry gap. Even if only for their own good.

Wednesday, May 01, 2013

Collaborations of Latex and Word users

The two popular text editors used by researchers in academia are LaTex and Microsoft Word. Or, put differently: Microsoft Word and LaTex. In more technical fields, LaTex is king, while in less technical fields, it is Word. In the business school worlds collide. Coming from a technical background, I am a heavy user of LaTex, for research papers and even for book writing. However, many of my business school collaborators (e.g., from fields of Information Systems and Marketing) are Word users.

While collaboration platforms such as Google Drive and Dropbox have greatly enhanced collaborative possibilities, including co-editing a document, the Word-or-Latex schism still poses a serious challenge. I've had to migrate to Word (and suffer) in some collaborations, while in others I convinced my co-authors to move the document to LaTex, but then I was the one receiving text bits to incorporate back into the document and share the compiled PDF (via Dropbox that's easy).

For someone used to LaTex, Word is quite awkward: handling different document components such as bibliography and sectioning is cumbersome; journal templates are easier to use in LaTex; writing formulas is much easier. For Word-users, LaTex usually seems intimidating, as it is not WYSIWYG (you must click a button to compile the text and then see the resulting PDF in a separate PDF viewer).

One solution is the open-source Lyx package, which has a graphical interface with LaTex "under the hood". I personally found it unsatisfactory, as it is "not here nor there"...

So what to do if you're a LaTex junky and want to move a project from Word into LaTex? Let's start with the initial migration. Here are a few useful tools that I discovered:
  • Tables: To convert a table from Word into a LaTex table, copy-paste into Excel and then use the Excel2Latex tool. Simply download the xla file and open it in Excel. It will add an add-in menu. Choose the table, click the convert button, and you can choose either to copy the LaTex code or to export it to a .tex file.
  • Bibliography: To convert a Word Bibliography file into a LaTex bib file, use the neat Word2Bibtex tool. Download the bibtex.xsl file and follow the directions. Note: the tool will only work if you have administrator privileges on the computer, as it requires copying a file into an "admin only" folder.
  • Figures: unlike Word, where you copy-paste images, in LaTex you'll need them as separate image files (png, jpg, eps, etc.). If you only have a few, right-click each figure in Word, then "Save as image" and choose png or jpg. If you already have a bunch of figures in the Word doc, save the doc as "filtered HTML". This will create a separate folder with all the image files (if they are saved as gif, you'll have to convert them to png or jpg).
Now, to the co-editing of the tex file. I have still not completely resolved the problem of the non-LaTex collaborators editing the file. I always get the question: "can you send me a Word version so that I can edit it?". Here are some options:
  • They can annotate the PDF file using highlighting and sticky notes.
  • They can copy-paste from the PDF file into Word. Figures can be copied using Acrobat Reader's Edit > Take a Snapshot, but they are usually not needed for editing.
  • They can open the .tex file with Wordpad for editing the text.
  • It's also possible to convert back to Word: The Latex2rtf tool should do that (it actually clashed with my TexStudio editor and erased my tex file!)
But then, even if you do convert to Word, what to do with the Word file once the collaborator has done his/her editing?
Another solution is to use a cloud LaTex platform, such as ShareLaTex.com. The advantage is that there's no need to install software and the editor + viewer are nicely set side-by-side with a big green "recompile" button. The free version allows collaborating with one free user. The paid versions are more generous (I like the "coming soon" integrations with Google Drive and Dropbox!). 
  • Catch #1: you must be online to compile. 
  • Catch #2: long documents such as books can take substantially longer to compile online compared to locally. 
  • Catch #3: if the non-LaTex collaborator uses some tex-unfriendly text (such as a $ sign to denote USD), the compilation will fail. So, basic tex knowledge is needed - or babysitting by the LaTex-head collaborator.
Would love to hear from others tackling these collaborative issues and have found good solutions.

Sunday, April 28, 2013

New short guide: "To Publish or To Self-Publish My Textbook?"

My self-publishing endeavors have led to a growing number of conversations with colleagues, friends, colleagues-of-friends and other permutations who've asked me to share my experiences. Finally, I decided to write down a short guide, which is now available as a Kindle eBook.

To Publish or To Self-Publish My Textbook? Notes from a Published and Self-Published Author gives a glimpse into the expectations, challenges, rewards, and surprises that an author experiences when publishing and/or self-publishing a textbook. This is not a guide on self-publishing, but rather notes about the process of publishing a textbook with a big publisher vs. self-publishing and what to expect.

To celebrate the launch, the eBook is FREE for 72 hours. Post the promotion it will still be cheaper than a cappuccino.

You can read the book (and any other Kindle book) on many devices -- no need for a Kindle device. You can use the Kindle Cloud Reader for online reading, or else download the free Kindle reading app for PC, iPad, Android, etc.


Wednesday, April 03, 2013

Analytics magazines: Please lead the way for effective data presentation

Professional "analytics" associations such INFORMS, the American Statistical Association, and the Royal Statistical Society, have been launching new magazines intended for broader, non-academic audiences that are involved or interested in data analytics. Several of these magazines are aesthetically beautiful with plenty of interesting articles about applications of data analysis and their impact on daily life, society, and more. Significance magazine and Analytics magazine are two examples.

The next step is for these magazines to implement what we preach regarding data presentation: use effective visualizations. In particular, the online versions can include interactive dashboards! If the New York Times and Washington Post can have interactive dashboards on their websites, so can magazines of statistics and operations research societies.

For example, the OR/MS Today magazine reports the results of an annual "statistical software survey" in the form of multi-page tables in the hardcopy and PDF versions of the magazine. These tables are not user friendly in the sense that it is difficult to explore and compare the products and tools. Surprisingly, the online implementation is even worse: a bunch of HTML pages, each with one static table.
Presenting the survey results in multi-page tables is not the most user-friendly (from Feb 2013 issue of OR/MS Today magazine)
To illustrate the point, I have converted the 2013 Statistical Software Survey results into an interactive dashboard. The user can examine and compare particular products or tools of interest using filters, sort the products by different attributes, and get a quick idea about pricing. Maybe not the most fascinating data, especially given the many missing values, yet I hope the dashboard is more effective and engaging.

Interactive dashboard. Click on the image to go to the dashboard


Tuesday, January 22, 2013

Business analytics student projects a valuable ground for industry-academia ties

Since October 2012, I have taught multiple courses on data mining and on forecasting. Teams of students worked on projects spanning various industries, from retail to eCommerce to telecom. Each project presents a business problem or opportunity that is translated into a data mining or forecasting problem. Using real data, the team then executes the analytics solution, evaluates it and presents recommendations. A select set of project reports and presentations is available on my website (search for 2012 Nov and 2012 Dec projects).

For projects this year, we used three datasets from regional sources (thanks to our industry partners Hansa Cequity and TheBargain.in). One is a huge dataset from an Indian retail chain of hyper markets. Another is data on electronic gadgets on online shopping sites in India. A third is a large survey on mobile usage conducted in India. These datasets were also used in several data mining contests that we set up during the course through CrowdANALYTIX.com and through Kaggle.com. The contests were open to the public and indeed submissions were given from around the world.

Business analytics courses are an excellent ground for industry-academia partnerships. Unlike one-way interactions such as guest lectures from industry or internships or site visits of students, a business analytics project that is conducted by student teams (with faculty guidance) creates value for both the industry partner who shares the data as well as the students. Students who have gained the basic understanding of data analytics can be creative about new uses that companies have not considered (this can be achieved through "ideation contests"). Companies can also use this ground for piloting or testing out the use or their data for addressing goals of interest with little investment. Students get first-hand experience with regional data and problems, and can showcase their project as they interview for positions that require such expertise.

So what is the catch? Building a strong relationship requires good, open-minded industry partners and a faculty member who can lead such efforts. It is a new role for most faculty teaching traditional statistics or data mining courses. Managing data confidentiality, creating data mining contests, initiating and maintaining open communication channels with all stakeholders is nontrivial. But well worth the effort.


Thursday, January 17, 2013

Predictive modeling and interventions (why you need post-intervention data)

In the last few months I've been involved in nearly 20 data mining projects done by student teams at ISB, as part of the MBA-level course and an executive education program.  All projects relied on real data. One of the data sources was transactional data from a large regional hyper market. While the topics of the projects ranged across a large spectrum of business goals and opportunities for retail, one point in particular struck me as repeating across many projects and in many face-to-face discussions. The use of secondary data (data that were already collected for some purpose) for making decisions and deriving insights regarding future interventions. 

By intervention I mean any action. In a marketing context, we can think of personalized coupons, advertising, customer care, etc.

In particular, many teams defined a data mining problem that would help them in determining appropriate target marketing. For example, predict whether the next shopping trip of a customer will include dairy products and then use this for offering appropriate promotions. Another example: predict whether a relatively new customer will be a high-value customer at the end of a year (as defined by some metric related to the customer's spending or shopping behavior), and use it to target for a "white glove" service. In other words, building a predictive model for deciding who, when and what to offer. While this approach seemed natural to many students and professionals, there are two major sticky points:

  1. we cannot properly evaluate the performance of the model in terms of actual business impact without post-intervention data. The reason is that without historical data on a similar intervention, we cannot evaluate how the targeted intervention will perform. For instance, while we can predict who is most likely to purchase dairy products from a large existing transactional database, we cannot tell whether they would redeem a coupon that is targeted to them unless we have some data post a similar coupon campaign.
  2. we cannot build a predictive model that is optimized with the intervention goal unless we have post-intervention data. For example, if coupon redemption is the intervention performance metric, we cannot build a predictive model optimizing coupon redemption unless we have data on coupon redemption.

A predictive model is trained on past data. To evaluate the effect of an intervention, we must have some post-intervention data in order to build a model that aims at optimizing the intervention goal, and also for being able to evaluate model performance in light of that goal. A pilot study/period is therefore a good way to start: either deploy it randomly or to the sample that is indicated by a predictive model to be optimal in some way (it is best to do both: deploy to a sample that has both a random choice and a model-indicated choice). Once you have the post-intervention data on the intervention results, you can build a predictive model to optimize results on a future, larger-scale intervention.

Tuesday, January 15, 2013

What does "business analytics" mean in academia?

But what exactly does this mean?
In the recent ISIS conference, I organized and moderated a panel called "Business Analytics and Big Data: How it affects Business School Research and Teaching". The goal was to tackle the ambiguity in the terms "Business Analytics" and "Big Data" in the context of business school research and teaching. I opened with a few points:

  1. Some research b-schools are posting job ads for tenure-track faculty in "Business Analytics" (e.g., University of Maryland; Google "professor business analytics position" for plenty more). What does this mean? what is supposed to be the background of these candidates and where are they supposed to publish to get promoted? ("The Journal of Business Analytics"?)
  2. A recent special issue of the top scholarly journal Management Science was devoted to "Business Analytics". What types of submissions fall under this label? what types do not?
  3. Many new "Business Analytics" programs have been springing up in business schools worldwide. What is new about their offerings? 

Panelists Anitesh, Ram and David - photo courtesy of Ravi Bapna
The panelist were a mix of academics (Prof Anitesh Barua from UT Austin and Prof Ram Chellapah from Emory University) and industry (Dr. David Hardoon, SAS Singapore). The audience was also a mixed crowd of academics mostly from MIS departments (in business schools) and industry experts from companies such as IBM and Deloitte.

The discussion took various twists and turns with heavy audience discussion. Here are several issues that emerged from the discussion:

  • Is there any meaning to BA in academia or is it just the use of analytics (=data tools) within a business context? Some industry folks said that BA is only meaningful within a business context, not research wise.
  • Is BA just a fancier name for statisticians in a business school or does it convey a different type of statistician? (similar to the adoption of "operations management" (OM) by many operation research (OR) academics)
  • The academics on the panel made the point that BA has been changing the flavor of research in terms of adding a discovery/exploratory dimension that does not typically exist in social science and IS research. Rather than only theorize-then-test-with-data, data are now explored in further detail using tools such as visualization and micro-level models. The main concern, however, was that it is still very difficult to publish such research in top journals.
  • With respect to "what constitutes a BA research article", Prof. Ravi Bapna said "it's difficult to specify what papers are BA, but it is easy to spot what is not BA".
  • While machine learning and data mining have been around for some good time, and the methods have not really changed, the application of both within a business context has become more popular due to friendlier software and stronger computing power. These new practices are therefore now an important core in MBA and other business programs. 
  • One type of b-school program that seems to lag behind on the BA front is the PhD program. Are we equipping our PhD students with abilities to deal with and take advantage of large datasets for developing theory? Are PhD programs revising their curriculum to include big data technologies and machine learning capabilities as required core courses?
Some participants claimed that BA is just another buzzword that will go away after some time. So we need not worry about defining it or demystifying it. After all, the software vendors coin such terms, create a buzz, and finally the buzz moves on. Whether this is the case with BA or with Big Data is yet to be seen. In the meanwhile, we should ponder whether we are really doing something new in our research, and if so, pinpoint to what exactly it is and how to formulate it as requirements for a new era of researchers.