Monday, May 13, 2013

An Appeal to Companies: Leave the Data Behind @kaggle @crowdanalytix_q

No comments:
A while ago I wrote about the wonderful new age of real-data-made-available-for-academic-use through the growing number of data mining contests on platforms such as Kaggle and CrowdANALYTIX. Such data provide excellent examples for courses on data mining that help train the next generation of data scientists, business analysts, and other data-savvy graduates. 

A couple of years later, I am discovering the painful truth that many of these dataset are no longer available. The reason is most likely due to the company who shared the data pulling their data out. This unexpected twist is extremely harmful to both academia and industry: instructors and teachers who design and teach data mining courses build course plans, assignments, and projects based on the assumption that the data will be available. And now, we have to revise all our materials! What a waste of time, resources, and energy. Obviously, after the first time this happens, I think twice whether to use contest data for teaching. I will not try to convince you of the waste of faculty time, the loss to students, or even the loss to self-learners who are now all over the globe. I'll wear my b-school professor hat and speak the ROI language: The companies lose. In the short term, all these students will not be competing in their contest. The bigger loss, however, is to the entire business sector in the longer term: I constantly receive urgent requests from businesses and large corporations for business analytics trained graduates. "We are in desperate need of data-savvy managers! Can you help us find good data scientists?". Yet, some of these same companies are pulling their data out of the hands of instructors and students.

I am perplexed as to why a company would pull their data away after the data was publicly available for a long time. It's not the fear of competitors, since the data were already publicly available. It's probably not the contest platform CEOs - they'd love the traffic. Then why? I smell lawyers...

One example is the beautiful Heritage Health contest that ran on Kaggle for 2 years. Now, there is no way to get the data, even if you were a registered contestant in the past (see screenshots). Yet, the healthcare sector has been begging for help in training "healthcare analytics" professionals. 


Data access is no longer available. Although it was for 2 years.

I'd like to send out an appeal to Heritage Provider Network and other companies that have shared invaluable data through a publicly-available contest platform. Please consider making the data publicly available forever. We will then be able to integrate it into data analytics courses, textbooks, research, and projects, thereby providing industry with not only insights, but also properly trained students who've learned using real data and real problems.

This time, industry can contribute in a major way to reducing the academia-industry gap. Even if only for their own good.

Tuesday, April 30, 2013

Collaborations of Latex and Word users

No comments:
The two popular text editors used by researchers in academia are LaTex and Microsoft Word. Or, put differently: Microsoft Word and LaTex. In more technical fields, LaTex is king, while in less technical fields, it is Word. In the business school worlds collide. Coming from a technical background, I am a heavy user of LaTex, for research papers and even for book writing. However, many of my business school collaborators (e.g., from fields of Information Systems and Marketing) are Word users.

While collaboration platforms such as Google Drive and Dropbox have greatly enhanced collaborative possibilities, including co-editing a document, the Word-or-Latex schism still poses a serious challenge. I've had to migrate to Word (and suffer) in some collaborations, while in others I convinced my co-authors to move the document to LaTex, but then I was the one receiving text bits to incorporate back into the document and share the compiled PDF (via Dropbox that's easy).

For someone used to LaTex, Word is quite awkward: handling different document components such as bibliography and sectioning is cumbersome; journal templates are easier to use in LaTex; writing formulas is much easier. For Word-users, LaTex usually seems intimidating, as it is not WYSIWYG (you must click a button to compile the text and then see the resulting PDF in a separate PDF viewer).

One solution is the open-source Lyx package, which has a graphical interface with LaTex "under the hood". I personally found it unsatisfactory, as it is "not here nor there"...

So what to do if you're a LaTex junky and want to move a project from Word into LaTex? Let's start with the initial migration. Here are a few useful tools that I discovered:
  • Tables: To convert a table from Word into a LaTex table, copy-paste into Excel and then use the Excel2Latex tool. Simply download the xla file and open it in Excel. It will add an add-in menu. Choose the table, click the convert button, and you can choose either to copy the LaTex code or to export it to a .tex file.
  • Bibliography: To convert a Word Bibliography file into a LaTex bib file, use the neat Word2Bibtex tool. Download the bibtex.xsl file and follow the directions. Note: the tool will only work if you have administrator privileges on the computer, as it requires copying a file into an "admin only" folder.
  • Figures: unlike Word, where you copy-paste images, in LaTex you'll need them as separate image files (png, jpg, eps, etc.). If you only have a few, right-click each figure in Word, then "Save as image" and choose png or jpg. If you already have a bunch of figures in the Word doc, save the doc as "filtered HTML". This will create a separate folder with all the image files (if they are saved as gif, you'll have to convert them to png or jpg).
Now, to the co-editing of the tex file. I have still not completely resolved the problem of the non-LaTex collaborators editing the file. I always get the question: "can you send me a Word version so that I can edit it?". Here are some options:
  • They can annotate the PDF file using highlighting and sticky notes.
  • They can copy-paste from the PDF file into Word. Figures can be copied using Acrobat Reader's Edit > Take a Snapshot, but they are usually not needed for editing.
  • They can open the .tex file with Wordpad for editing the text.
  • It's also possible to convert back to Word: The Latex2rtf tool should do that (it actually clashed with my TexStudio editor and erased my tex file!)
But then, even if you do convert to Word, what to do with the Word file once the collaborator has done his/her editing?
Another solution is to use a cloud LaTex platform, such as ShareLaTex.com. The advantage is that there's no need to install software and the editor + viewer are nicely set side-by-side with a big green "recompile" button. The free version allows collaborating with one free user. The paid versions are more generous (I like the "coming soon" integrations with Google Drive and Dropbox!). 
  • Catch #1: you must be online to compile. 
  • Catch #2: long documents such as books can take substantially longer to compile online compared to locally. 
  • Catch #3: if the non-LaTex collaborator uses some tex-unfriendly text (such as a $ sign to denote USD), the compilation will fail. So, basic tex knowledge is needed - or babysitting by the LaTex-head collaborator.
Would love to hear from others tackling these collaborative issues and have found good solutions.

Saturday, April 27, 2013

New short guide: "To Publish or To Self-Publish My Textbook?"

No comments:
My self-publishing endeavors have led to a growing number of conversations with colleagues, friends, colleagues-of-friends and other permutations who've asked me to share my experiences. Finally, I decided to write down a short guide, which is now available as a Kindle eBook.

To Publish or To Self-Publish My Textbook? Notes from a Published and Self-Published Author gives a glimpse into the expectations, challenges, rewards, and surprises that an author experiences when publishing and/or self-publishing a textbook. This is not a guide on self-publishing, but rather notes about the process of publishing a textbook with a big publisher vs. self-publishing and what to expect.

To celebrate the launch, the eBook is FREE for 72 hours. Post the promotion it will still be cheaper than a cappuccino.

You can read the book (and any other Kindle book) on many devices -- no need for a Kindle device. You can use the Kindle Cloud Reader for online reading, or else download the free Kindle reading app for PC, iPad, Android, etc.


Wednesday, April 03, 2013

Analytics magazines: Please lead the way for effective data presentation

1 comment:
Professional "analytics" associations such INFORMS, the American Statistical Association, and the Royal Statistical Society, have been launching new magazines intended for broader, non-academic audiences that are involved or interested in data analytics. Several of these magazines are aesthetically beautiful with plenty of interesting articles about applications of data analysis and their impact on daily life, society, and more. Significance magazine and Analytics magazine are two examples.

The next step is for these magazines to implement what we preach regarding data presentation: use effective visualizations. In particular, the online versions can include interactive dashboards! If the New York Times and Washington Post can have interactive dashboards on their websites, so can magazines of statistics and operations research societies.

For example, the OR/MS Today magazine reports the results of an annual "statistical software survey" in the form of multi-page tables in the hardcopy and PDF versions of the magazine. These tables are not user friendly in the sense that it is difficult to explore and compare the products and tools. Surprisingly, the online implementation is even worse: a bunch of HTML pages, each with one static table.
Presenting the survey results in multi-page tables is not the most user-friendly (from Feb 2013 issue of OR/MS Today magazine)
To illustrate the point, I have converted the 2013 Statistical Software Survey results into an interactive dashboard. The user can examine and compare particular products or tools of interest using filters, sort the products by different attributes, and get a quick idea about pricing. Maybe not the most fascinating data, especially given the many missing values, yet I hope the dashboard is more effective and engaging.

Interactive dashboard. Click on the image to go to the dashboard


Tuesday, January 22, 2013

Business analytics student projects a valuable ground for industry-academia ties

No comments:
Since October 2012, I have taught multiple courses on data mining and on forecasting. Teams of students worked on projects spanning various industries, from retail to eCommerce to telecom. Each project presents a business problem or opportunity that is translated into a data mining or forecasting problem. Using real data, the team then executes the analytics solution, evaluates it and presents recommendations. A select set of project reports and presentations is available on my website (search for 2012 Nov and 2012 Dec projects).

For projects this year, we used three datasets from regional sources (thanks to our industry partners Hansa Cequity and TheBargain.in). One is a huge dataset from an Indian retail chain of hyper markets. Another is data on electronic gadgets on online shopping sites in India. A third is a large survey on mobile usage conducted in India. These datasets were also used in several data mining contests that we set up during the course through CrowdANALYTIX.com and through Kaggle.com. The contests were open to the public and indeed submissions were given from around the world.

Business analytics courses are an excellent ground for industry-academia partnerships. Unlike one-way interactions such as guest lectures from industry or internships or site visits of students, a business analytics project that is conducted by student teams (with faculty guidance) creates value for both the industry partner who shares the data as well as the students. Students who have gained the basic understanding of data analytics can be creative about new uses that companies have not considered (this can be achieved through "ideation contests"). Companies can also use this ground for piloting or testing out the use or their data for addressing goals of interest with little investment. Students get first-hand experience with regional data and problems, and can showcase their project as they interview for positions that require such expertise.

So what is the catch? Building a strong relationship requires good, open-minded industry partners and a faculty member who can lead such efforts. It is a new role for most faculty teaching traditional statistics or data mining courses. Managing data confidentiality, creating data mining contests, initiating and maintaining open communication channels with all stakeholders is nontrivial. But well worth the effort.


Wednesday, January 16, 2013

Predictive modeling and interventions (why you need post-intervention data)

No comments:
In the last few months I've been involved in nearly 20 data mining projects done by student teams at ISB, as part of the MBA-level course and an executive education program.  All projects relied on real data. One of the data sources was transactional data from a large regional hyper market. While the topics of the projects ranged across a large spectrum of business goals and opportunities for retail, one point in particular struck me as repeating across many projects and in many face-to-face discussions. The use of secondary data (data that were already collected for some purpose) for making decisions and deriving insights regarding future interventions. 

By intervention I mean any action. In a marketing context, we can think of personalized coupons, advertising, customer care, etc.

In particular, many teams defined a data mining problem that would help them in determining appropriate target marketing. For example, predict whether the next shopping trip of a customer will include dairy products and then use this for offering appropriate promotions. Another example: predict whether a relatively new customer will be a high-value customer at the end of a year (as defined by some metric related to the customer's spending or shopping behavior), and use it to target for a "white glove" service. In other words, building a predictive model for deciding who, when and what to offer. While this approach seemed natural to many students and professionals, there are two major sticky points:

  1. we cannot properly evaluate the performance of the model in terms of actual business impact without post-intervention data. The reason is that without historical data on a similar intervention, we cannot evaluate how the targeted intervention will perform. For instance, while we can predict who is most likely to purchase dairy products from a large existing transactional database, we cannot tell whether they would redeem a coupon that is targeted to them unless we have some data post a similar coupon campaign.
  2. we cannot build a predictive model that is optimized with the intervention goal unless we have post-intervention data. For example, if coupon redemption is the intervention performance metric, we cannot build a predictive model optimizing coupon redemption unless we have data on coupon redemption.

A predictive model is trained on past data. To evaluate the effect of an intervention, we must have some post-intervention data in order to build a model that aims at optimizing the intervention goal, and also for being able to evaluate model performance in light of that goal. A pilot study/period is therefore a good way to start: either deploy it randomly or to the sample that is indicated by a predictive model to be optimal in some way (it is best to do both: deploy to a sample that has both a random choice and a model-indicated choice). Once you have the post-intervention data on the intervention results, you can build a predictive model to optimize results on a future, larger-scale intervention.

Tuesday, January 15, 2013

What does "business analytics" mean in academia?

1 comment:
But what exactly does this mean?
In the recent ISIS conference, I organized and moderated a panel called "Business Analytics and Big Data: How it affects Business School Research and Teaching". The goal was to tackle the ambiguity in the terms "Business Analytics" and "Big Data" in the context of business school research and teaching. I opened with a few points:

  1. Some research b-schools are posting job ads for tenure-track faculty in "Business Analytics" (e.g., University of Maryland; Google "professor business analytics position" for plenty more). What does this mean? what is supposed to be the background of these candidates and where are they supposed to publish to get promoted? ("The Journal of Business Analytics"?)
  2. A recent special issue of the top scholarly journal Management Science was devoted to "Business Analytics". What types of submissions fall under this label? what types do not?
  3. Many new "Business Analytics" programs have been springing up in business schools worldwide. What is new about their offerings? 

Panelists Anitesh, Ram and David - photo courtesy of Ravi Bapna
The panelist were a mix of academics (Prof Anitesh Barua from UT Austin and Prof Ram Chellapah from Emory University) and industry (Dr. David Hardoon, SAS Singapore). The audience was also a mixed crowd of academics mostly from MIS departments (in business schools) and industry experts from companies such as IBM and Deloitte.

The discussion took various twists and turns with heavy audience discussion. Here are several issues that emerged from the discussion:

  • Is there any meaning to BA in academia or is it just the use of analytics (=data tools) within a business context? Some industry folks said that BA is only meaningful within a business context, not research wise.
  • Is BA just a fancier name for statisticians in a business school or does it convey a different type of statistician? (similar to the adoption of "operations management" (OM) by many operation research (OR) academics)
  • The academics on the panel made the point that BA has been changing the flavor of research in terms of adding a discovery/exploratory dimension that does not typically exist in social science and IS research. Rather than only theorize-then-test-with-data, data are now explored in further detail using tools such as visualization and micro-level models. The main concern, however, was that it is still very difficult to publish such research in top journals.
  • With respect to "what constitutes a BA research article", Prof. Ravi Bapna said "it's difficult to specify what papers are BA, but it is easy to spot what is not BA".
  • While machine learning and data mining have been around for some good time, and the methods have not really changed, the application of both within a business context has become more popular due to friendlier software and stronger computing power. These new practices are therefore now an important core in MBA and other business programs. 
  • One type of b-school program that seems to lag behind on the BA front is the PhD program. Are we equipping our PhD students with abilities to deal with and take advantage of large datasets for developing theory? Are PhD programs revising their curriculum to include big data technologies and machine learning capabilities as required core courses?
Some participants claimed that BA is just another buzzword that will go away after some time. So we need not worry about defining it or demystifying it. After all, the software vendors coin such terms, create a buzz, and finally the buzz moves on. Whether this is the case with BA or with Big Data is yet to be seen. In the meanwhile, we should ponder whether we are really doing something new in our research, and if so, pinpoint to what exactly it is and how to formulate it as requirements for a new era of researchers.