Friday, April 20, 2007

Statistics are not always the blame!

My current MBA student Brenda Martineau showed me a March 15, 2007 article in the Wall Street Journal entitled Stupid Cancer Statistics. Makes you almost think that once again someone is abusing statistics -- but wait! A closer look reveals that the real culprit is not the "mathematical models", but rather the variable that is being measured and analyzed!

According to the article, the main fault is in measuring (and modeling) mortality rate in order to determine the usefulness of breast cancer early screening. Women who get diagnosed early (before the cancer escapes the lung) do not necessarily live longer than those who do not get diagnosed. But their quality of life is much improved. Therefore, the author explaines, the real measure should be quality of life. If I understand this correctly, this really has nothing to do with "faulty statistics", but rather with the choice of measurement to analyze!

In short, although a popular habit, you can't always blame all statistical models all the time...

Classification Trees: CART vs. CHAID

When it comes to classification trees, there are three major algorithms used in practice. CART ("Classification and Regression Trees"), C4.5, and CHAID.

All three algorithms create classification rules by constructing a tree-like structure of the data. However, they are different in a few important ways.

The main difference is in the tree construction process. In order to avoid over-fitting the data, all methods try to limit the size of the resulting tree. CHAID (and variants of CHAID) achieve this by using a statistical stopping rule that discontinuous tree growth. In contrast, both CART and C4.5 first grow the full tree and then prune it back. The tree pruning is done by examining the performance of the tree on a holdout dataset, and comparing it to the performance on the training set. The tree is pruned until the performance is similar on both datasets (thereby indicating that there is no over-fitting of the training set). This highlights another difference between the methods: CHAID and C4.5 use a single dataset to arrive at the final tree, whereas CART uses a training set to build the tree and a holdout set to prune it.

A difference between CART and the other two is that the CART splitting rule allows only binary splits (e.g., "if Income<$50K then X, else Y"), whereas C4.5 and CHAID allow multiple splits. In the latter, trees sometimes look more like bushes. CHAID has been especially popular in marketing research, in the context of market segmentation. In other areas, CART and C4.5 tend to be more popular. One important difference that came to my mind is in the goal that CHAID is most useful for, compared to the goal of CART. To clarify my point, let me first explain the CHAID mechanism in a bit more detail. At each split, the algorithm looks for the predictor variable that if split, most "explains" the category response variable. In order to decide whether to create a particular split based on this variable, the CHAID algorithm tests a hypothesis regarding dependence between the splitted variable and the categorical response(using the chi-squared test for independence). Using a pre-specified significance level, if the test shows that the splitted variable and the response are independent, the algorithm stops the tree growth. Otherwise the split is created, and the next best split is searched. In contrast, the CART algorithm decides on a split based on the amount of homogeneity within class that is achieved by the split. And later on, the split is reconsidered based on considerations of over-fitting. Now I get to my point: It appears to me that CHAID is most useful for analysis, whereas CART is more suitable for prediction. In other words, CHAID should be used when the goal is to describe or understand the relationship between a response variable and a set of explanatory variables, whereas CART is better suited for creating a model that has high prediction accuracy of new cases.

In the book Statistics Methods and Applications by Hill and Lewicki, the authors mention another related difference, related to CART's binary splits vs. CHAIDs multiple-category splits: "CHAID often yields many terminal nodes connected to a single branch, which can be conveniently summarized in a simple two-way table with multiple categories for each variable of dimension of the table. This type of display matches well the requirements for research on market segmentation... CART will always yield binary trees, which sometimes can not be summarized as efficiently for interpretation and/or presentation". In other words, if the goal is explanatory, CHAID is better suited for the task.

There are additional differences between the algorithms, which I will not mention here. Some can be found in the excellent Statistics Methods and Applications by Hill and Lewicki.

Tuesday, April 10, 2007

Another Treemap in NYT!

While we're at it, this Saturday's Business section of the New York Times featured the article Sifting data to Uncover Travel Deals. One of the websites mentioned (PointMaven.com) actually uses a Treemap to display hotel points promotions.



OK -- full disclosure: this is my husband's website and yes, I was involved... But hey -- that's the whole point of having an in-house statistician!

Monday, April 02, 2007

Visualizing hierarchical data

Today much data is gathered from the web. Data from websites often tend to be hierarchical in nature: For example, on Amazon we have categories (music, books, etc.), then within a category there are sub-categories (e.g, within Books: Business & Technology, Childrens' books, etc.), and sometimes there are ever additional layers. Other examples are eBay, epinions, and almost any e-tailor. Even travel sites usually include some level of hierarchy.

The standard plots and graphs such as bar charts, histograms, boxplots might be useful for visualizing a particular level of hierarchy, but not the "big picture". The method of trellising is useful, where a particular graph is "broken down" by one or more variables. However, you still do not directly see the hierarchy.

An ingenious method for visualizing hierarchical data is the Treemap, designed by Professor Ben Shneiderman from the Human-Computer Lab at the University of Maryland. The treemap is basically a rectangle region broken down into sub-rectangles (and then possbily into further sub-sub-rectangles), where each basic smallest rectangle represents the unit of interest. Then color and/or size can be used to describe measures of interest.

Treemap's original goal was to visualize one's hard drive (with all its directories and sub-directories) for detecting pheonomena such as duplications. There a file was a single entity, and its size, for instance, could be represented by the rectangle's size. Since its development in the 1990s it has spread widely across almost every possible discipline. Probably the most popular application is in SmartMoney's Map of the Market where you can visualize the current state of the entire stock market. The strength of the treemap lies both in the ability to include multiple levels of hierarchy (you can drill-in and out to different levels) and also in its interactive nature, where users can choose to manipulate color, size, and order to represent measures of interest.

Microsoft research posts a free Excel add-on called Treemapper, but after trying it out I think it is too limited: It allows only one level of hierarchy and does not have any interactivity (it also requires only numerical information).

Last month the business section of the New York Times featured an article This time, no roadside assistance on DaimlerChrysler, which included a neat Treemap. Since it is no longer available online (NYT does not include graphics in its archives...) here it is -- courtesy of Amanda Cox from the NYT, known as their "statistics wiz".


You can find many more neat examples of using Treemap on the HCIL website.