Wednesday, September 19, 2012

Self-publishing to the rescue

The new Coursera course by Princeton Professor Mung Chiang was so popular that Amazon and the publisher ran out of copies of the textbook before the course even started (see "new website features" announcement; requires login). I experienced a stockout of my own textbook ("Data Mining for Business Intelligence") a couple of years ago, which caused grief and slight panic to both students and instructors.

With stockouts in mind, and recognizing the difficulty of obtaining textbooks outside of North America (unavailable, too expensive, or long/costly shipping), I decided to take things into my own hands and self-publish a "Practical Analytics" series of textbooks. Currently, the series has three books. All are available in soft-cover editions and Kindle editions. I used CreateSpace.com, an Amazon company, for publishing the soft-cover editions. This reduces the stockout problem due to a print-on-demand model. I used Amazon KDP for publishing the Kindle editions, so definitely no stockouts there. Amazon makes the books available on its global websites and so reachable in many places worldwide (the Indian Flipkart also avails the books). Finally, since I got to set the prices, I made sure to keep them affordable (for example, in India the e-books are even cheaper than in the USA).

How has this endeavor fared? Well, more than 1000 copies were sold since March 2011. Several instructors adopted books for their courses. And from reader emails and ratings on Amazon, it looks like I'm on the right track.

To celebrate the power and joy of self-publishing as well as accessible and affordable knowledge, I am running a "free e-book promotion" next week. The following e-books will be available for free:

Both promotions will commence a little after midnight, Pacific Standard Time, and will last for 24 hours. To download each of the e-books, just go to the Amazon website during the promotion period and search for the title. You will then be able to download the book for free.

Enjoy, and feel free to share!

Saturday, September 01, 2012

Trees in pivot table terminology

Recently, I've been requested by non-data-mining colleagues to explain how Classification and Regression Trees work. While a detailed explanation with examples exists in my co-authored textbook Data Mining for Business Intelligence, I found that the following explanation worked well with people who are familiar with Excel's Pivot Tables:

Classification tree for predicting vulnerability to famine
Suppose the goal is to generate predictions for some variable, numerical or categorical, given a set of predictors. The idea behind trees is to create groups of records with similar profiles in terms of their predictors, and then average the outcome variable of interest to generate a prediction.

Here's an interesting example from the paper Identifying Indicators of Vulnerability to Famine and Chronic Food Insecurity by Yohannes and Webb, showing predictors of vulnerability to famine based on a survey of households. The image shows all the predictors that were identified by the tree, which appear below each circle. Each predictor is a binary variable and you go right or left depending on the value of the predictor. It is easiest to start reading from the top, with an household in mind.

Our goal is to generate groups of households with similar profiles, where profiles are the combination of answers to different survey questions. 
Using the language of pivot tables, our predictions will be in the Values field, and we can use the Row (or Column) Labels to break down the predictors. What does the tree do? Here's a "pivot table" description:

  1. Drag the outcome of interest into the Values area
  2. Find the first predictor that best splits the profiles and drag it into the Row Label field*.
  3. Given the first predictor, find the next predictor to further split the profiles, and drag into the Row Label field** .
  4. Given the first two splits, find the next predictor to further split the profiles (could also be one of the earlier variables) and drag into the Row Label field***
  5. Continue this process until some over-fitting criterion is reached
You might imagine the final result as a really crowded Pivot Table, with multiple predictors in the Row Label fields. This is indeed quite close, except for two slight differences:

* Each time a predictor is dragged into the Row or Column Labels fields, it is converted into a binary variable, creating only two classes. For example, 
  • Gender would not change (Female/Male)
  • Country could be turned into "India/Other". 
  • noncereal yield was discretized into "Above/below 4.7".

** After a predictor is dragged, the next predictor is actually dragged only into one of the two splits of the first predictor. In our example, after dragging noncereal yield (Above/Below 4.7), the predictor oxen owned (Above/Below 1.5) only applies to noncereal yield Below 4.7.

*** We also note that a tree can "drag" a predictor more than once into the Row Labels fields. For example, TLU/capita appears twice in the tree, so theoretically in the pivot table we'd drag TLU/capita after oxen owned and again after crop diversity.

So where is the "intelligence" of a tree over an elaborate pivot table? First, it automatically determines which predictor is the best one to use at each stage. Second, it automatically determines the value on which to split. Third, it knows when to stop, to avoid over-fitting the data. In a pivot table, the user would have to determine which predictors to include, their order, and what are the critical values to split on. And finally, this complex process going on behind the scenes is easily interpretable by a tree chart.