Tuesday, January 22, 2013

Business analytics student projects a valuable ground for industry-academia ties

Since October 2012, I have taught multiple courses on data mining and on forecasting. Teams of students worked on projects spanning various industries, from retail to eCommerce to telecom. Each project presents a business problem or opportunity that is translated into a data mining or forecasting problem. Using real data, the team then executes the analytics solution, evaluates it and presents recommendations. A select set of project reports and presentations is available on my website (search for 2012 Nov and 2012 Dec projects).

For projects this year, we used three datasets from regional sources (thanks to our industry partners Hansa Cequity and TheBargain.in). One is a huge dataset from an Indian retail chain of hyper markets. Another is data on electronic gadgets on online shopping sites in India. A third is a large survey on mobile usage conducted in India. These datasets were also used in several data mining contests that we set up during the course through CrowdANALYTIX.com and through Kaggle.com. The contests were open to the public and indeed submissions were given from around the world.

Business analytics courses are an excellent ground for industry-academia partnerships. Unlike one-way interactions such as guest lectures from industry or internships or site visits of students, a business analytics project that is conducted by student teams (with faculty guidance) creates value for both the industry partner who shares the data as well as the students. Students who have gained the basic understanding of data analytics can be creative about new uses that companies have not considered (this can be achieved through "ideation contests"). Companies can also use this ground for piloting or testing out the use or their data for addressing goals of interest with little investment. Students get first-hand experience with regional data and problems, and can showcase their project as they interview for positions that require such expertise.

So what is the catch? Building a strong relationship requires good, open-minded industry partners and a faculty member who can lead such efforts. It is a new role for most faculty teaching traditional statistics or data mining courses. Managing data confidentiality, creating data mining contests, initiating and maintaining open communication channels with all stakeholders is nontrivial. But well worth the effort.


Thursday, January 17, 2013

Predictive modeling and interventions (why you need post-intervention data)

In the last few months I've been involved in nearly 20 data mining projects done by student teams at ISB, as part of the MBA-level course and an executive education program.  All projects relied on real data. One of the data sources was transactional data from a large regional hyper market. While the topics of the projects ranged across a large spectrum of business goals and opportunities for retail, one point in particular struck me as repeating across many projects and in many face-to-face discussions. The use of secondary data (data that were already collected for some purpose) for making decisions and deriving insights regarding future interventions. 

By intervention I mean any action. In a marketing context, we can think of personalized coupons, advertising, customer care, etc.

In particular, many teams defined a data mining problem that would help them in determining appropriate target marketing. For example, predict whether the next shopping trip of a customer will include dairy products and then use this for offering appropriate promotions. Another example: predict whether a relatively new customer will be a high-value customer at the end of a year (as defined by some metric related to the customer's spending or shopping behavior), and use it to target for a "white glove" service. In other words, building a predictive model for deciding who, when and what to offer. While this approach seemed natural to many students and professionals, there are two major sticky points:

  1. we cannot properly evaluate the performance of the model in terms of actual business impact without post-intervention data. The reason is that without historical data on a similar intervention, we cannot evaluate how the targeted intervention will perform. For instance, while we can predict who is most likely to purchase dairy products from a large existing transactional database, we cannot tell whether they would redeem a coupon that is targeted to them unless we have some data post a similar coupon campaign.
  2. we cannot build a predictive model that is optimized with the intervention goal unless we have post-intervention data. For example, if coupon redemption is the intervention performance metric, we cannot build a predictive model optimizing coupon redemption unless we have data on coupon redemption.

A predictive model is trained on past data. To evaluate the effect of an intervention, we must have some post-intervention data in order to build a model that aims at optimizing the intervention goal, and also for being able to evaluate model performance in light of that goal. A pilot study/period is therefore a good way to start: either deploy it randomly or to the sample that is indicated by a predictive model to be optimal in some way (it is best to do both: deploy to a sample that has both a random choice and a model-indicated choice). Once you have the post-intervention data on the intervention results, you can build a predictive model to optimize results on a future, larger-scale intervention.

Tuesday, January 15, 2013

What does "business analytics" mean in academia?

But what exactly does this mean?
In the recent ISIS conference, I organized and moderated a panel called "Business Analytics and Big Data: How it affects Business School Research and Teaching". The goal was to tackle the ambiguity in the terms "Business Analytics" and "Big Data" in the context of business school research and teaching. I opened with a few points:

  1. Some research b-schools are posting job ads for tenure-track faculty in "Business Analytics" (e.g., University of Maryland; Google "professor business analytics position" for plenty more). What does this mean? what is supposed to be the background of these candidates and where are they supposed to publish to get promoted? ("The Journal of Business Analytics"?)
  2. A recent special issue of the top scholarly journal Management Science was devoted to "Business Analytics". What types of submissions fall under this label? what types do not?
  3. Many new "Business Analytics" programs have been springing up in business schools worldwide. What is new about their offerings? 

Panelists Anitesh, Ram and David - photo courtesy of Ravi Bapna
The panelist were a mix of academics (Prof Anitesh Barua from UT Austin and Prof Ram Chellapah from Emory University) and industry (Dr. David Hardoon, SAS Singapore). The audience was also a mixed crowd of academics mostly from MIS departments (in business schools) and industry experts from companies such as IBM and Deloitte.

The discussion took various twists and turns with heavy audience discussion. Here are several issues that emerged from the discussion:

  • Is there any meaning to BA in academia or is it just the use of analytics (=data tools) within a business context? Some industry folks said that BA is only meaningful within a business context, not research wise.
  • Is BA just a fancier name for statisticians in a business school or does it convey a different type of statistician? (similar to the adoption of "operations management" (OM) by many operation research (OR) academics)
  • The academics on the panel made the point that BA has been changing the flavor of research in terms of adding a discovery/exploratory dimension that does not typically exist in social science and IS research. Rather than only theorize-then-test-with-data, data are now explored in further detail using tools such as visualization and micro-level models. The main concern, however, was that it is still very difficult to publish such research in top journals.
  • With respect to "what constitutes a BA research article", Prof. Ravi Bapna said "it's difficult to specify what papers are BA, but it is easy to spot what is not BA".
  • While machine learning and data mining have been around for some good time, and the methods have not really changed, the application of both within a business context has become more popular due to friendlier software and stronger computing power. These new practices are therefore now an important core in MBA and other business programs. 
  • One type of b-school program that seems to lag behind on the BA front is the PhD program. Are we equipping our PhD students with abilities to deal with and take advantage of large datasets for developing theory? Are PhD programs revising their curriculum to include big data technologies and machine learning capabilities as required core courses?
Some participants claimed that BA is just another buzzword that will go away after some time. So we need not worry about defining it or demystifying it. After all, the software vendors coin such terms, create a buzz, and finally the buzz moves on. Whether this is the case with BA or with Big Data is yet to be seen. In the meanwhile, we should ponder whether we are really doing something new in our research, and if so, pinpoint to what exactly it is and how to formulate it as requirements for a new era of researchers.

Thursday, October 04, 2012

Flipping and virtualizing learning

Adopting new technology for teaching has been one of my passions, and luckily my students have been understanding even during glitches or choices that turn out to be ineffective (such as the mobile/Internet voting technology that I wrote about last year). My goal has been to use technology to make my courses more interactive: I use clickers for in-class polling (to start discussions and assess understanding, not for grading!); last year, after realizing that my students were constantly on Facebook, I finally opened a Facebook account and ran a closed FB group for out-of-class discussions; In my online courses on statistics.com I created interactive lessons (slides with media, quizzes, etc.) using Udutu.com. On the pedagogical side, I have tried to focus on hands-on learning: team projects took over exams, in-class presentations and homework that get your hands dirty.

But all these were just baby steps, preparing me for the big leap. In the last month, I have been immersed in a complete transformation of one of my on-ground courses: The new approach is a combination of a new technology and a recent pedagogical movement. The pedagogical side is called 'flipping the classroom', where class time is not spent on one-directional lecturing but rather on discussions and other interactive activities. The technological leap is the move towards a Massive Open Online Course (MOOC) – but in my case a "moderate open online course". As a first step, the course will be open only to the community of the Indian School of Business (students, alumni, faculty and staff). The long term plan is to open it up globally.

The course Business Analytics using Data Mining is opening in less than two weeks. I've been working round-the-clock creating content for the online and on-ground components, figuring out the right technologies that can support all the requirements, and collaborating with colleagues at CrowdANALYTIX and at Hansa Cequity to integrate large local datasets and a platform for running data mining contests into the course.

Here are the ingredients that I found essential:
  • You need strong support from the university! Luckily, ISB is a place that embraces innovation and is willing to evaluate cutting-edge teaching approaches.
  • A platform that is easy for a (somewhat tech-savvy instructor) instructor to design, to upload materials, to update, to interact with participants, and in general, to run. If you are a control freak like me, the last thing you want is to need to ask someone else to upload, edit, or change things. After researching many possibilities, I decided to use the Google platform. Not the new Google Course Builder platform (who has time for programming in Javascript?), but rather a unique combination of Google Sites, Google Drive, Google Groups, YouTube embedding, etc. The key is Google Sites, which is an incredibly versatile tool (and free! thanks Google!). Another advantage of Google Sites is that you have the solid backbone of Google behind you. If your university uses Google Apps for Education, all the better (we hope to move there soon...)
  • Definitely worthwhile to invest in a good video editing software. This was a painful experience. After starting with one software that was causing grief, I switched to Camtasia Studio, and very quickly purchased a license. It is an incredibly powerful yet simple to use software for recording video+screen+audio and then editing (cutting out coughs, for instance)
  • Hardware for lecture videos: Use a good webcam that also has a good mic. I learned that audio quality is the biggest reason for people to stop watching a video. Getting the Thimphu street dogs to stop barking is always a challenge. If you're in a power-outage prone area, make sure to get a back-up battery (UPS).
  • Have several people go over the course platform to make sure that all the links work, the videos stream, etc. Also, get someone to assist with participants' technical queries. There are always those who need hand-holding.
The way the course will work at ISB is that the ISB community can join the online component (lecture videos, guided reading, online forum, contests). Registered students will also attend on-ground meetings that will focus on discussions, project-based learning, and other interactive activities. 

We opened registration to the community today and there are already more than 200 registrants. I guess everyone is curious! Whether the transformation will be a huge success or will die out quietly is yet to be seen. But for sure, there will be insights and learning for all of us.


Wednesday, September 19, 2012

Self-publishing to the rescue

The new Coursera course by Princeton Professor Mung Chiang was so popular that Amazon and the publisher ran out of copies of the textbook before the course even started (see "new website features" announcement; requires login). I experienced a stockout of my own textbook ("Data Mining for Business Intelligence") a couple of years ago, which caused grief and slight panic to both students and instructors.

With stockouts in mind, and recognizing the difficulty of obtaining textbooks outside of North America (unavailable, too expensive, or long/costly shipping), I decided to take things into my own hands and self-publish a "Practical Analytics" series of textbooks. Currently, the series has three books. All are available in soft-cover editions and Kindle editions. I used CreateSpace.com, an Amazon company, for publishing the soft-cover editions. This reduces the stockout problem due to a print-on-demand model. I used Amazon KDP for publishing the Kindle editions, so definitely no stockouts there. Amazon makes the books available on its global websites and so reachable in many places worldwide (the Indian Flipkart also avails the books). Finally, since I got to set the prices, I made sure to keep them affordable (for example, in India the e-books are even cheaper than in the USA).

How has this endeavor fared? Well, more than 1000 copies were sold since March 2011. Several instructors adopted books for their courses. And from reader emails and ratings on Amazon, it looks like I'm on the right track.

To celebrate the power and joy of self-publishing as well as accessible and affordable knowledge, I am running a "free e-book promotion" next week. The following e-books will be available for free:

Both promotions will commence a little after midnight, Pacific Standard Time, and will last for 24 hours. To download each of the e-books, just go to the Amazon website during the promotion period and search for the title. You will then be able to download the book for free.

Enjoy, and feel free to share!

Saturday, September 01, 2012

Trees in pivot table terminology

Recently, I've been requested by non-data-mining colleagues to explain how Classification and Regression Trees work. While a detailed explanation with examples exists in my co-authored textbook Data Mining for Business Intelligence, I found that the following explanation worked well with people who are familiar with Excel's Pivot Tables:

Classification tree for predicting vulnerability to famine
Suppose the goal is to generate predictions for some variable, numerical or categorical, given a set of predictors. The idea behind trees is to create groups of records with similar profiles in terms of their predictors, and then average the outcome variable of interest to generate a prediction.

Here's an interesting example from the paper Identifying Indicators of Vulnerability to Famine and Chronic Food Insecurity by Yohannes and Webb, showing predictors of vulnerability to famine based on a survey of households. The image shows all the predictors that were identified by the tree, which appear below each circle. Each predictor is a binary variable and you go right or left depending on the value of the predictor. It is easiest to start reading from the top, with an household in mind.

Our goal is to generate groups of households with similar profiles, where profiles are the combination of answers to different survey questions. 
Using the language of pivot tables, our predictions will be in the Values field, and we can use the Row (or Column) Labels to break down the predictors. What does the tree do? Here's a "pivot table" description:

  1. Drag the outcome of interest into the Values area
  2. Find the first predictor that best splits the profiles and drag it into the Row Label field*.
  3. Given the first predictor, find the next predictor to further split the profiles, and drag into the Row Label field** .
  4. Given the first two splits, find the next predictor to further split the profiles (could also be one of the earlier variables) and drag into the Row Label field***
  5. Continue this process until some over-fitting criterion is reached
You might imagine the final result as a really crowded Pivot Table, with multiple predictors in the Row Label fields. This is indeed quite close, except for two slight differences:

* Each time a predictor is dragged into the Row or Column Labels fields, it is converted into a binary variable, creating only two classes. For example, 
  • Gender would not change (Female/Male)
  • Country could be turned into "India/Other". 
  • noncereal yield was discretized into "Above/below 4.7".

** After a predictor is dragged, the next predictor is actually dragged only into one of the two splits of the first predictor. In our example, after dragging noncereal yield (Above/Below 4.7), the predictor oxen owned (Above/Below 1.5) only applies to noncereal yield Below 4.7.

*** We also note that a tree can "drag" a predictor more than once into the Row Labels fields. For example, TLU/capita appears twice in the tree, so theoretically in the pivot table we'd drag TLU/capita after oxen owned and again after crop diversity.

So where is the "intelligence" of a tree over an elaborate pivot table? First, it automatically determines which predictor is the best one to use at each stage. Second, it automatically determines the value on which to split. Third, it knows when to stop, to avoid over-fitting the data. In a pivot table, the user would have to determine which predictors to include, their order, and what are the critical values to split on. And finally, this complex process going on behind the scenes is easily interpretable by a tree chart.

Tuesday, August 07, 2012

The mad rush: Masters in Analytics programs

The recent trend among mainstream business schools is opening a graduate program or a concentration in Business Analytics (BA). Googling "MS Business Analytics" reveals lots of big players offering such programs. A few examples (among many others) are:

These programs are intended (aside from making money) to bridge the knowledge gap between the "data or IT team" and the business experts. Graduates should be able to lead analytics teams in companies, identifying opportunities where analytics can add value, understanding pitfalls, being able to figure out the needed human and technical resources, and most importantly -- communicating analytics with top management. Unlike "marketing analytics" or other domain-specific programs, Business Analytics programs are "tools" oriented.

As a professor of statistics, I feel a combination of excitement and pain. The word Analytics is clearly more attractive than Statistics. But it is also broader in two senses. First, it combines methods and tools from a wider set of disciplines: statistics, operations research, artificial intelligence, computer science. Second, although technical skills are required to some degree, the focus is on the big picture and how the tools fit into the business process. In other words, it's about Business Analytics.

I am excited about the trend of BA programs because finally they are able to force disciplines such as statistics into considering the large picture and fitting in both in terms of research and teaching. Research is clearly better guided by real problems. The top research journals are beginning to catch up: Management Science has an upcoming special issue on Business Analytics. As for teaching, it is exciting to teach students who are thirsty for analytics. The challenge is for instructors with PhDs in statistics, operations, computer science or other disciplines to repackage the technical knowledge into a communicable, interesting and useful curriculum. Formulas or algorithms, as beautiful as they might appear to us, are only tolerated when their beauty is clearly translated into meaningful and useful knowledge. Considering the business context requires a good deal of attention and often modifying our own modus operandi (we've all been brainwashed by our research discipline).

But then, there's the painful part of the missed opportunity for statisticians to participate as major players (or is it envy?). The statistics community seems to be going through this cycle of "hey, how did we get left behind?". This happened with data mining, and is now happening with data analytics. The great majority of Statistics programs continuously fail to be the leaders of the non-statistics world. Examining the current BA trend, I see that

  1. Statisticians are typically not the leaders of these programs. 
  2. Business schools who lack statistics faculty (and that's typical) are either hiring non-research statisticians as adjunct faculty to teach statistics and data mining courses or else these courses are taught by faculty from other areas such as information systems and operations.
  3. "Data Analytics" or "Analytics" degrees are still not offered by mainstream Statistics departments. For example, North Carolina State U has an Institute for Advanced Analytics that offers an MS in Analytics degree. Yet, this does not appear to be linked to the Statistics Department's programs. Carnegie Mellon's Heinz Business College offers a Master degree with concentration in BI and BA, yet the Statistics department offers a Masters in Statistical Practice.
My greatest hope is that a new type of "analytics" research faculty member evolves. The new breed, while having deep knowledge in one field, will also posses more diverse knowledge and openness to other analytics fields (statistical modeling, data mining, operations research methods, computing, human-computer visualization principles). At the same time, for analytics research to flourish, the new breed academic must have a foot in a particular domain, any domain, be it in the social sciences, humanities, engineering, life-sciences, or other. I can only imagine the exciting collaboration among such groups of academics, as well as the value that they bring to research, teaching and knowledge dissemination to other fields.