As the Fall semester just came to a close, another cohort of 36 MBAs completed the data mining course at the Smith School of Business. Students worked throughout the semester on real business problems with real data. From data collection, through exploration, and modeling.

Projects ranged from more socially aware (Profiling the medically- insured vs. uninsured in the USA; Reverse-engineering student loan deference algorithms) to more $ aware (Customer retention at an online fitness company; Drivers of dividend decreases) . A few projects on real-estate and travel (Determinants of flight delays from Washington to Honolulu; Factors leading to a quick-sale of condos in Arlington) and one on healthcare (Predicting delays in the operating room), which one the vote of the class for "best project".

To see short presentations and reports on these projects, please see the course webpage.

Finally, I am happy to announce that the course's new official name is "Data Mining for Business Intelligence". On some university webpages "Intelligence" was dropped. But I won't crack any jokes!

## Thursday, December 13, 2007

## Friday, November 30, 2007

### Insights from the Netflix contest

The neat recent Wall Street Journal article Netflix Aims to Refine Art of Picking Films (Nov 20, 2007) was sent to me by Moshe Cohen, one of my dedicated ex-data-mining-course students. In the article, a spokesman from Netflix demystifies some of the winning techniques in the Netflix $1 million contest. OK, not really demystifying, but revealing two interesting insights:

1) Some teams joined forces by combining their predictions to obtain improved predictions (without disclosing their actual algorithms to each other). Today, for instance, the third best team on the Netflix Leaderboard is "When Gravity and Dinosaurs Unite", which is the result of two teams combining their predictions(Gravity from Hungary and Dinosaur Planet from US). This is an example of the "portfolio approach" which says that combining predictions from a variety of methods (and sometimes a variety of datasets) can lead to higher performance, just like stock portfolios.

2) AT&T, who is currently in the lead, takes an approach that includes 107 different techniques (blended in different ways). You can get a glimpse of these methods in their publicly available document written by Robert Bell, Yehuda Koren, and Chris Volinsky (kudos for the "open-source"!). They use regression models, k-nearest-neighbor methods, collaborative filtering, "portfolios" of the different methods, etc. Again, this shows that "looking" at data from multiple views is usually very beneficial. Like painkillers, a variety is useful because sometimes one works but other times another works better.

Please note that this does NOT suggest that a portfolio approach with painkillers is recommended!

1) Some teams joined forces by combining their predictions to obtain improved predictions (without disclosing their actual algorithms to each other). Today, for instance, the third best team on the Netflix Leaderboard is "When Gravity and Dinosaurs Unite", which is the result of two teams combining their predictions(Gravity from Hungary and Dinosaur Planet from US). This is an example of the "portfolio approach" which says that combining predictions from a variety of methods (and sometimes a variety of datasets) can lead to higher performance, just like stock portfolios.

2) AT&T, who is currently in the lead, takes an approach that includes 107 different techniques (blended in different ways). You can get a glimpse of these methods in their publicly available document written by Robert Bell, Yehuda Koren, and Chris Volinsky (kudos for the "open-source"!). They use regression models, k-nearest-neighbor methods, collaborative filtering, "portfolios" of the different methods, etc. Again, this shows that "looking" at data from multiple views is usually very beneficial. Like painkillers, a variety is useful because sometimes one works but other times another works better.

Please note that this does NOT suggest that a portfolio approach with painkillers is recommended!

## Thursday, November 08, 2007

### Good and bad of classification/regression trees

Classification and Regression Trees are great for both explanatory and predictive modeling. Although data driven, they provide transparency about the resulting classifier are are far from being a blackbox. For this reason trees are often in applications that require transparency, such as insurance or credit approvals.

Trees are also used during the exploratory phase for the purpose of variable selection: variables that show up at the top layers of the tree are good candidates as "key players".

Trees do not make any distributional assumptions and are also quite robust to outliers. They can nicely capture local pockets of behavior that would require complicated interaction terms in regression type models. Although this sounds like the perfect tool, there is no free lunch. First, a tree usually requires

As in any prediction task, the greatest danger is that of over-fitting. In trees this is avoided by either stopping tree growth (e.g., in CHAID type trees that are popular in marketing) , or by growing the entire tree and then pruning it. In the latter case, when comparing the full and pruned tree there will usually be a huge difference in the tree sizes. However, there could be cases where the two trees have similar out-of-sample performance: this happens when the data contain very little noise. In that case over-fitting is not substantial. You can find such an example in our book Data Mining for Business Intelligence ("Acceptance of Personal Loan", chap 7 pp. 120-129).

Trees are also used during the exploratory phase for the purpose of variable selection: variables that show up at the top layers of the tree are good candidates as "key players".

Trees do not make any distributional assumptions and are also quite robust to outliers. They can nicely capture local pockets of behavior that would require complicated interaction terms in regression type models. Although this sounds like the perfect tool, there is no free lunch. First, a tree usually requires

**lots of data**: the tree is built on a training set; then, in CART trees the validation set is used to prune the tree for avoiding over-fitting; Finally, a test dataset is needed for evaluating the actual performance of the tree on new data. Second, a tree can be pretty**computationally expensive**to create, as a function of the number of variables. Building a tree requires evaluating a huge number of splits on all possible variables and their values (especially if they are numeric). The good news is that once the tree is built, scoring new data is cheap (unlike k-nearest-neighbor algorithms that are also very costly in scoring new data).As in any prediction task, the greatest danger is that of over-fitting. In trees this is avoided by either stopping tree growth (e.g., in CHAID type trees that are popular in marketing) , or by growing the entire tree and then pruning it. In the latter case, when comparing the full and pruned tree there will usually be a huge difference in the tree sizes. However, there could be cases where the two trees have similar out-of-sample performance: this happens when the data contain very little noise. In that case over-fitting is not substantial. You can find such an example in our book Data Mining for Business Intelligence ("Acceptance of Personal Loan", chap 7 pp. 120-129).

## Wednesday, September 19, 2007

### Webcast on Analytics in the Classroom

Tomorrow at 11:00 EST I will be giving a webcast describing several term projects by MBAs in my data mining class. Students have been working on real business projects in my class for 4 years now, with many of the projects leading to important insights to the companies who provided the data (in most cases the students' workplaces).

For each of several cases I will describe the business objective; we'll look at the data via interactive visualization using Spotfire, and then examine some of the analyses and findings.

The webcast is organized by Spotfire (now a division of TIBCO). We have been using their interactive visualization software in classes via their b-school education outreach program.

To join the webcast tomorrow, please register: Analytics in the Classroom- Giving today's MBA's a Competitive Advantage, by Dr. Galit Shmueli, Univ. of MD

For each of several cases I will describe the business objective; we'll look at the data via interactive visualization using Spotfire, and then examine some of the analyses and findings.

The webcast is organized by Spotfire (now a division of TIBCO). We have been using their interactive visualization software in classes via their b-school education outreach program.

To join the webcast tomorrow, please register: Analytics in the Classroom- Giving today's MBA's a Competitive Advantage, by Dr. Galit Shmueli, Univ. of MD

## Thursday, September 06, 2007

### Data mining = Evil?

Some get a chill when they hear "data mining" because they associate it with "big brother". Well, here's one more major incident that sheds darkness on smart algorithms: The Department of Homeland Security declared the end of a data mining program called ADVISE (Analysis, Dissemination, Visualization, Insight and Semantic Enhancement). Why? Because it turns out that they were testing it for two years on live data on real people "without meeting privacy requirements" (Yahoo! News: DHS ends criticized data-mining program).

There is nothing wrong or evil about data mining. It's like any other tool: you can use it or abuse it. Issues of privacy and confidentiality in data usage have always been there and will continue to be a major concern as more and more of our private data gets stored in commercial, government, and other databases.

Many students in my data mining class use data from their workplace for their term project. The projects almost always turn out to be insightful and useful beyond the class exercise. But we do always make sure to obtain permission, de-identify, and protect and restrict access to the data as needed. Good practice is the key to keeping "data mining" a positive term!

There is nothing wrong or evil about data mining. It's like any other tool: you can use it or abuse it. Issues of privacy and confidentiality in data usage have always been there and will continue to be a major concern as more and more of our private data gets stored in commercial, government, and other databases.

Many students in my data mining class use data from their workplace for their term project. The projects almost always turn out to be insightful and useful beyond the class exercise. But we do always make sure to obtain permission, de-identify, and protect and restrict access to the data as needed. Good practice is the key to keeping "data mining" a positive term!

## Wednesday, September 05, 2007

### Shaking up the statistics community

A new book is gaining emotional reactions for the normally calm statistics community (no pun intended): The Black Swan: The Impact of the Highly Improbably by Nassim Taleb uses blunt language to critique the field of statistics, statisticians, and users of statistics. I have not yet read the book, but from the many reviews and coverage I am running to get a copy.

The widely read ASA statistics journal The American Statistician decided to devote a special section that reviews the book and even obtained a (somewhat bland) response from the author. Four reputable statisticians (Robert Lund, Peter Westfall, Joseph Hilbe, and Aaron Brown) reviewed the book, some trying to confront some of the arguments and criticize the author for making some unscientific claims. A few even have formulas and derivations. All four agree that this is an important read for statisticians, and that it raises some interesting points that we should ponder upon.

The author's experiences come from the world of finance, where he worked for investment banks, a hedge fund, and finally made a fortune at his own hedge fund. His main claim (as I understand from the reviews and coverage) is that analytics should focus more on the tails, or the unusual, and not as much on the "average". That's true in many applications (e.g., in my own research in biosurveillance, for early detection of disease outbreak, or in anomaly detection as a whole). Before I make any other claims, though, I must rush to read the book!

The widely read ASA statistics journal The American Statistician decided to devote a special section that reviews the book and even obtained a (somewhat bland) response from the author. Four reputable statisticians (Robert Lund, Peter Westfall, Joseph Hilbe, and Aaron Brown) reviewed the book, some trying to confront some of the arguments and criticize the author for making some unscientific claims. A few even have formulas and derivations. All four agree that this is an important read for statisticians, and that it raises some interesting points that we should ponder upon.

The author's experiences come from the world of finance, where he worked for investment banks, a hedge fund, and finally made a fortune at his own hedge fund. His main claim (as I understand from the reviews and coverage) is that analytics should focus more on the tails, or the unusual, and not as much on the "average". That's true in many applications (e.g., in my own research in biosurveillance, for early detection of disease outbreak, or in anomaly detection as a whole). Before I make any other claims, though, I must rush to read the book!

## Thursday, July 19, 2007

### Handling outliers with a smile

Here's one of the funniest statistics cartoons that I've seen (thanks Adi Gadwale!) First you laugh, then you cry.

Also reminds me of the claim by the famous industrial statistician George Box "

Also reminds me of the claim by the famous industrial statistician George Box "

**All models are wrong, but some are useful**"

**.**

## Wednesday, July 11, 2007

### The Riverplot: Visualizing distributions over time

The boxplot is one of the neatest visualizations for examining the distribution of values, or for comparing distribtions. It is more compact than a histogram in that it only presents the median, the two quartiles, the range of the data, and outliers. It also requires less user input than a histogram (where the user usually has to determine the number of bins). I view the boxplot and histogram as complements, and examining both is good practice.

But how can you visualize a distribution of values over time? Well, a series of boxplots often does the trick. But if the frequency is very high (e.g., ticker data) and the time scale of interest can be considered continuous, then an alternative is the River Plot. This is a visualization that we developed together with our colleagues at the Human Computer Interaction Lab on campus. It is essentiall a "continuous boxplot" that displays the median and quartiles (and potentially the range or other statistics). It is suitable when you have multiple time series that can be considered replicates (e.g., bid in multiple eBay auctions for an iPhone). We implemented it in the interactive time series visualization tool called Time Searcher, which allows to visualize and interactively explore a large set of time series with attributes.

Time Searcher is a powerful tool and allows the user to search for patterns, filter, and also to forecast an ongoing time series from its past and a historic database of similar time series. But then the Starbucks effect of

But how can you visualize a distribution of values over time? Well, a series of boxplots often does the trick. But if the frequency is very high (e.g., ticker data) and the time scale of interest can be considered continuous, then an alternative is the River Plot. This is a visualization that we developed together with our colleagues at the Human Computer Interaction Lab on campus. It is essentiall a "continuous boxplot" that displays the median and quartiles (and potentially the range or other statistics). It is suitable when you have multiple time series that can be considered replicates (e.g., bid in multiple eBay auctions for an iPhone). We implemented it in the interactive time series visualization tool called Time Searcher, which allows to visualize and interactively explore a large set of time series with attributes.

Time Searcher is a powerful tool and allows the user to search for patterns, filter, and also to forecast an ongoing time series from its past and a historic database of similar time series. But then the Starbucks effect of

*too many choices*kicks in. Together with our colleague Paolo Buono from Universita de Bari, Italy, we added the feature of "simultaneous previews": the user can choose multiple different parameter setting and view the resulting forecasts simultaneously. This was presented in the most recent InfoVis conference (Similarity-Based Forecasting with Simultaneous Previews: A River Plot Interface for Time Series Forecasting).## Friday, May 18, 2007

### The good, bad and ugly graphs

In his May 2007 newsletter Stephen Few, a data visualization guru with an expertise in business data, created the The Graph Design I.Q. Test. "This brief I.Q. test leads you through a series of 10 questions that ask you to choose which of two graphs presents the data more effectively".

I took it myself (and Stephen is probably tracking my answers!) -- it's very cool and quickly teaches a good lesson in good vs. bad graphics and tables. You will be strongly discouraged after it to abuse color, 3D, etc.

If you got hooked, Stephen has a big bag of goodies for those who want to learn about creating good graphs and tables. He wrote a beautiful book called "show me the numbers"

His website's Library also included an abundance of useful articles.

I took it myself (and Stephen is probably tracking my answers!) -- it's very cool and quickly teaches a good lesson in good vs. bad graphics and tables. You will be strongly discouraged after it to abuse color, 3D, etc.

If you got hooked, Stephen has a big bag of goodies for those who want to learn about creating good graphs and tables. He wrote a beautiful book called "show me the numbers"

His website's Library also included an abundance of useful articles.

## Friday, May 11, 2007

### NYT to mine their own data

You might ask yourself how on earth I have time for an entry during the last day of classes. Well, I don't. That's why I am doing it.

The New York Times recently announced to their stockholders that they are going to be revolutionary by mining their own data. As quoted from the village voice,

The article focuses on the alarm that this causes in terms of "what happens when the government comes in and subpoenas it?"

My question is, since every company and organization is mining (or potentially can mine) their own data anyway, what is the purpose of announcing it publicly? Clearly data mining is not such a "futuristic" act. What kind of "hidden patterns" are they looking for? the paths that readers take when they move between articles? what precedes their clicking an ad? Or maybe there is a futuristic goal?

The New York Times recently announced to their stockholders that they are going to be revolutionary by mining their own data. As quoted from the village voice,

Data mining, [The company CEO Janet Robinson] told the crowd, would be used "to determine hidden patterns of uses to our website." This was just one of the many futuristic projects in the works by the newspapers company's research and development program

The article focuses on the alarm that this causes in terms of "what happens when the government comes in and subpoenas it?"

My question is, since every company and organization is mining (or potentially can mine) their own data anyway, what is the purpose of announcing it publicly? Clearly data mining is not such a "futuristic" act. What kind of "hidden patterns" are they looking for? the paths that readers take when they move between articles? what precedes their clicking an ad? Or maybe there is a futuristic goal?

## Friday, April 20, 2007

### Statistics are not always the blame!

My current MBA student Brenda Martineau showed me a March 15, 2007 article in the Wall Street Journal entitled Stupid Cancer Statistics. Makes you almost think that once again someone is abusing statistics -- but wait! A closer look reveals that the real culprit is not the "mathematical models", but rather the variable that is being measured and analyzed!

According to the article, the main fault is in measuring (and modeling)

In short, although a popular habit, you can't always blame all statistical models all the time...

According to the article, the main fault is in measuring (and modeling)

*mortality rate*in order to determine the usefulness of breast cancer early screening. Women who get diagnosed early (before the cancer escapes the lung) do not necessarily live longer than those who do not get diagnosed. But their quality of life is much improved. Therefore, the author explaines, the real measure should be quality of life. If I understand this correctly, this really has nothing to do with "faulty statistics", but rather with the choice of measurement to analyze!In short, although a popular habit, you can't always blame all statistical models all the time...

### Classification Trees: CART vs. CHAID

When it comes to classification trees, there are three major algorithms used in practice. CART ("Classification and Regression Trees"), C4.5, and CHAID.

All three algorithms create classification rules by constructing a tree-like structure of the data. However, they are different in a few important ways.

The main difference is in the tree construction process. In order to avoid over-fitting the data, all methods try to limit the size of the resulting tree. CHAID (and variants of CHAID) achieve this by using a statistical stopping rule that discontinuous tree growth. In contrast, both CART and C4.5 first grow the full tree and then prune it back. The tree pruning is done by examining the performance of the tree on a holdout dataset, and comparing it to the performance on the training set. The tree is pruned until the performance is similar on both datasets (thereby indicating that there is no over-fitting of the training set). This highlights another difference between the methods: CHAID and C4.5 use a single dataset to arrive at the final tree, whereas CART uses a training set to build the tree and a holdout set to prune it.

A difference between CART and the other two is that the CART splitting rule allows only binary splits (e.g., "if Income<$50K then X, else Y"), whereas C4.5 and CHAID allow multiple splits. In the latter, trees sometimes look more like bushes. CHAID has been especially popular in marketing research, in the context of market segmentation. In other areas, CART and C4.5 tend to be more popular. One important difference that came to my mind is in the goal that CHAID is most useful for, compared to the goal of CART. To clarify my point, let me first explain the CHAID mechanism in a bit more detail. At each split, the algorithm looks for the predictor variable that if split, most "explains" the category response variable. In order to decide whether to create a particular split based on this variable, the CHAID algorithm tests a hypothesis regarding dependence between the splitted variable and the categorical response(using the chi-squared test for independence). Using a pre-specified significance level, if the test shows that the splitted variable and the response are independent, the algorithm stops the tree growth. Otherwise the split is created, and the next best split is searched. In contrast, the CART algorithm decides on a split based on the amount of homogeneity within class that is achieved by the split. And later on, the split is reconsidered based on considerations of over-fitting. Now I get to my point:

In the book Statistics Methods and Applications by Hill and Lewicki, the authors mention another related difference, related to CART's binary splits vs. CHAIDs multiple-category splits: "CHAID often yields many terminal nodes connected to a single branch, which can be conveniently summarized in a simple two-way table with multiple categories for each variable of dimension of the table. This type of display matches well the requirements for research on market segmentation... CART will always yield binary trees, which sometimes can not be summarized as efficiently for interpretation and/or presentation". In other words, if the goal is explanatory, CHAID is better suited for the task.

There are additional differences between the algorithms, which I will not mention here. Some can be found in the excellent Statistics Methods and Applications by Hill and Lewicki.

All three algorithms create classification rules by constructing a tree-like structure of the data. However, they are different in a few important ways.

The main difference is in the tree construction process. In order to avoid over-fitting the data, all methods try to limit the size of the resulting tree. CHAID (and variants of CHAID) achieve this by using a statistical stopping rule that discontinuous tree growth. In contrast, both CART and C4.5 first grow the full tree and then prune it back. The tree pruning is done by examining the performance of the tree on a holdout dataset, and comparing it to the performance on the training set. The tree is pruned until the performance is similar on both datasets (thereby indicating that there is no over-fitting of the training set). This highlights another difference between the methods: CHAID and C4.5 use a single dataset to arrive at the final tree, whereas CART uses a training set to build the tree and a holdout set to prune it.

A difference between CART and the other two is that the CART splitting rule allows only binary splits (e.g., "if Income<$50K then X, else Y"), whereas C4.5 and CHAID allow multiple splits. In the latter, trees sometimes look more like bushes. CHAID has been especially popular in marketing research, in the context of market segmentation. In other areas, CART and C4.5 tend to be more popular. One important difference that came to my mind is in the goal that CHAID is most useful for, compared to the goal of CART. To clarify my point, let me first explain the CHAID mechanism in a bit more detail. At each split, the algorithm looks for the predictor variable that if split, most "explains" the category response variable. In order to decide whether to create a particular split based on this variable, the CHAID algorithm tests a hypothesis regarding dependence between the splitted variable and the categorical response(using the chi-squared test for independence). Using a pre-specified significance level, if the test shows that the splitted variable and the response are independent, the algorithm stops the tree growth. Otherwise the split is created, and the next best split is searched. In contrast, the CART algorithm decides on a split based on the amount of homogeneity within class that is achieved by the split. And later on, the split is reconsidered based on considerations of over-fitting. Now I get to my point:

**It appears to me that CHAID is most useful for**. In other words, CHAID should be used when the goal is to describe or understand the relationship between a response variable and a set of explanatory variables, whereas CART is better suited for creating a model that has high prediction accuracy of new cases.*analysis*, whereas CART is more suitable for*prediction*In the book Statistics Methods and Applications by Hill and Lewicki, the authors mention another related difference, related to CART's binary splits vs. CHAIDs multiple-category splits: "CHAID often yields many terminal nodes connected to a single branch, which can be conveniently summarized in a simple two-way table with multiple categories for each variable of dimension of the table. This type of display matches well the requirements for research on market segmentation... CART will always yield binary trees, which sometimes can not be summarized as efficiently for interpretation and/or presentation". In other words, if the goal is explanatory, CHAID is better suited for the task.

There are additional differences between the algorithms, which I will not mention here. Some can be found in the excellent Statistics Methods and Applications by Hill and Lewicki.

## Tuesday, April 10, 2007

### Another Treemap in NYT!

While we're at it, this Saturday's Business section of the New York Times featured the article Sifting data to Uncover Travel Deals. One of the websites mentioned (PointMaven.com) actually uses a Treemap to display hotel points promotions.

OK -- full disclosure: this is my husband's website and yes, I was involved... But hey -- that's the whole point of having an in-house statistician!

OK -- full disclosure: this is my husband's website and yes, I was involved... But hey -- that's the whole point of having an in-house statistician!

## Monday, April 02, 2007

### Visualizing hierarchical data

Today much data is gathered from the web. Data from websites often tend to be hierarchical in nature: For example, on Amazon we have categories (music, books, etc.), then within a category there are sub-categories (e.g, within Books: Business & Technology, Childrens' books, etc.), and sometimes there are ever additional layers. Other examples are eBay, epinions, and almost any e-tailor. Even travel sites usually include some level of hierarchy.

The standard plots and graphs such as bar charts, histograms, boxplots might be useful for visualizing a particular level of hierarchy, but not the "big picture". The method of trellising is useful, where a particular graph is "broken down" by one or more variables. However, you still do not directly see the hierarchy.

An ingenious method for visualizing hierarchical data is the Treemap, designed by Professor Ben Shneiderman from the Human-Computer Lab at the University of Maryland. The treemap is basically a rectangle region broken down into sub-rectangles (and then possbily into further sub-sub-rectangles), where each basic smallest rectangle represents the unit of interest. Then color and/or size can be used to describe measures of interest.

Treemap's original goal was to visualize one's hard drive (with all its directories and sub-directories) for detecting pheonomena such as duplications. There a file was a single entity, and its size, for instance, could be represented by the rectangle's size. Since its development in the 1990s it has spread widely across almost every possible discipline. Probably the most popular application is in SmartMoney's Map of the Market where you can visualize the current state of the entire stock market. The strength of the treemap lies both in the ability to include multiple levels of hierarchy (you can drill-in and out to different levels) and also in its interactive nature, where users can choose to manipulate color, size, and order to represent measures of interest.

Microsoft research posts a free Excel add-on called Treemapper, but after trying it out I think it is too limited: It allows only one level of hierarchy and does not have any interactivity (it also requires only numerical information).

Last month the business section of the New York Times featured an article This time, no roadside assistance on DaimlerChrysler, which included a neat Treemap. Since it is no longer available online (NYT does not include graphics in its archives...) here it is -- courtesy of Amanda Cox from the NYT, known as their "statistics wiz".

You can find many more neat examples of using Treemap on the HCIL website.

The standard plots and graphs such as bar charts, histograms, boxplots might be useful for visualizing a particular level of hierarchy, but not the "big picture". The method of trellising is useful, where a particular graph is "broken down" by one or more variables. However, you still do not directly see the hierarchy.

An ingenious method for visualizing hierarchical data is the Treemap, designed by Professor Ben Shneiderman from the Human-Computer Lab at the University of Maryland. The treemap is basically a rectangle region broken down into sub-rectangles (and then possbily into further sub-sub-rectangles), where each basic smallest rectangle represents the unit of interest. Then color and/or size can be used to describe measures of interest.

Treemap's original goal was to visualize one's hard drive (with all its directories and sub-directories) for detecting pheonomena such as duplications. There a file was a single entity, and its size, for instance, could be represented by the rectangle's size. Since its development in the 1990s it has spread widely across almost every possible discipline. Probably the most popular application is in SmartMoney's Map of the Market where you can visualize the current state of the entire stock market. The strength of the treemap lies both in the ability to include multiple levels of hierarchy (you can drill-in and out to different levels) and also in its interactive nature, where users can choose to manipulate color, size, and order to represent measures of interest.

Microsoft research posts a free Excel add-on called Treemapper, but after trying it out I think it is too limited: It allows only one level of hierarchy and does not have any interactivity (it also requires only numerical information).

Last month the business section of the New York Times featured an article This time, no roadside assistance on DaimlerChrysler, which included a neat Treemap. Since it is no longer available online (NYT does not include graphics in its archives...) here it is -- courtesy of Amanda Cox from the NYT, known as their "statistics wiz".

You can find many more neat examples of using Treemap on the HCIL website.

## Thursday, March 29, 2007

### Stock performance and CEO house size study

The

At this point I had to find out more details! I tracked the research article called "Where are the shareholder's mansions? CEOs' home purchases, stock sales, and subsequent company performance", which contains further details about the data and the analysis. The authors describe the tedious job of assembling the house information data from multiple databases, dealing with missing values and differences in information from different sources. A few questions come to mind:

*BusinessWeek*article "The CEO Mega-Mansion Factor" (April 2, 2007) definitely caught my attention -- Two finance professors (Liu and Yermack) collected data on house sizes of CEOs of the S&P 500 companies in 2004. Their theory is "If home purchases represent a signal of commitment by the CEO, subsequent stock performance of the company should at least remain unchanged and possibly improve. Conversely, if home purchases represent a signal of entrenchment, we would expect stock performance to decline after the time of purchase." The article summarizes the results: "[they] found that 12% of [CEOs] lived in homes of at least 10,000 square feet, or a minimum of 10 acres. And their companies' stocks? In 2005 they lagged behind those of S&P 500 CEOs living in smaller houses by 7%, on average".At this point I had to find out more details! I tracked the research article called "Where are the shareholder's mansions? CEOs' home purchases, stock sales, and subsequent company performance", which contains further details about the data and the analysis. The authors describe the tedious job of assembling the house information data from multiple databases, dealing with missing values and differences in information from different sources. A few questions come to mind:

- A plot of value of CEO residence vs. CEO tenure in office (both in log scale) has a suspicious fan-shape, indicating that the variability in residence value increases in CEO tenure. If this is true, it would mean that the fitted regression line (with slope .15) is not an adequate model and therefore its interpretation not valid. A simple look at the residuals would give the answer.
- The exploratory step indicates a gap between the performance of below-median CEO house sizes and above-median houses. Now the question is whether the difference is random or reflects a true difference. In order to test the statistical significance of these differences the researchers had to define "house size". They decided to do the following (due to missing values):

"We adopt a simple scheme for classifying a CEO’s residence as “large” if it has either 10,000 square feet of floor area or at least 10 acres of land." While this rule is somewhat ad hoc, it fits our data nicely by identifying about 15% of the sample residences as extremely large.Since this is an arbitrary cutoff, it is important to evaluate its effect on the results: what happens if other cutoffs are used? Is there a better way to combine the information that is not missing in order to obtain a better metric? - The main statistical tests, which compare the stock performances of different types of houses (above- vs. below-median market values; "large" vs. not-"large" homes), are a series of t-tests for comparing means and Wilcoxon tests for comparing medians. Of all 8 performed tests, only one ended up with a p-value below 5%. The one exception is a difference between median stock performance of "large-home" CEOs and "not-large home" CEOs. Recall that this is based on the arbitrary definition of a "large" home. In other words, the differences in stock performances do not appear to be strongly statistically significant. This might improve as the sample sizes are increased -- a large number of observations was dropped due to missing values.
- Finally, another interesting point is how the model can be used.
*BusinessWeek*quotes Yermack: "If [the CEO] buys a big mansion, sell the stock". Such a claim means that house size is predictive of stock performance. However, the model (as described in the research paper) was not constructed as a predictive model: there is no holdout set to evaluate predictive accuracy, and no predictive measures are mentioned. Finding a statistically significant relationship between house size and subsequent stock performance is not necessarily indicative of predictive power.

## Tuesday, March 20, 2007

### Google purchases data visualization tool

Once again, some hot news from my ex-student Adi Gadwale: Google recently purchased a data visualization tool from Professor Hans Rosling at Stockholm's Karolinska Institute (read the story). Adi also sent me the link to Gapminder, the tool that Google has put out.http://tools.google.com/gapminder. For those of us who've become addicts of the interactive visualization tool Spotfire, this looks pretty familiar!

## Wednesday, March 07, 2007

### Multiple Testing

My colleague Ralph Russo often comes up with memorable examples for teaching complicated concepts. He recently sent me an Economist article called "Signs of the Times" that shows the absurd results that can be obtained if multiple testing is not taken into account.

Multiple testing arises when the same data are used simultaneously for testing many hypotheses. The problem is a huge inflation in the type I error (i.e., rejecting the null hypothesis in error). Even if each single hypothesis is carried out at a low significance level (e.g., the infamous 5% level), the aggregate type I error becomes huge very fast. In fact, if testing k hypotheses that are independent of each other, each at significance level alpha, then the total type I error is 1-(1-alpha)^k. That's right - it grows exponentially. For example, if we test 7 independent hypotheses at a 10% significance level, the overall type I error is 52%. In other words, even if none of these hypotheses are true, we will see on average more than half of the p-values below 10%.

In the Economist article, Dr. Austin tests a set of multiple absurd "medical" hypotheses (such as "people born under the astrological sign of Leo are 15% more likely to be admitted to hospital with gastric bleeding than those born under the other 11 signs"). He shows that some of these hypotheses are "supported by the data", if we ignore multiple testing.

There is a variety of solutions for multiple testing, some older (such as the classic Bonferonni correction) and some more recent (such as the False Discovery Rate). But most importantly, this issue should be recognized.

Multiple testing arises when the same data are used simultaneously for testing many hypotheses. The problem is a huge inflation in the type I error (i.e., rejecting the null hypothesis in error). Even if each single hypothesis is carried out at a low significance level (e.g., the infamous 5% level), the aggregate type I error becomes huge very fast. In fact, if testing k hypotheses that are independent of each other, each at significance level alpha, then the total type I error is 1-(1-alpha)^k. That's right - it grows exponentially. For example, if we test 7 independent hypotheses at a 10% significance level, the overall type I error is 52%. In other words, even if none of these hypotheses are true, we will see on average more than half of the p-values below 10%.

In the Economist article, Dr. Austin tests a set of multiple absurd "medical" hypotheses (such as "people born under the astrological sign of Leo are 15% more likely to be admitted to hospital with gastric bleeding than those born under the other 11 signs"). He shows that some of these hypotheses are "supported by the data", if we ignore multiple testing.

There is a variety of solutions for multiple testing, some older (such as the classic Bonferonni correction) and some more recent (such as the False Discovery Rate). But most importantly, this issue should be recognized.

### Source for data

Adi Gadwale, a student in my 2004 MBA Data Mining class, still remembers my fetish with business data and data visualization. He just sent me a link to an IBM Research website called Many Eyes, which includes user-submitted datasets as well as Java-applet visualizations.

The datasets include quite a few "junk" datasets, lots with no description. But there are a few interesting ones: FDIC is a "scrubbed list of FDIC institutions removing inactive entities and stripping all columns apart from Assets, ROE, ROA, Offices (Branches), and State". It includes 8711 observations. Another is Absorption Coefficients of Common Materials - I can just see the clustering exercise! Or the 2006 Top 100 Video Games by Sales. There are social-network data, time series, and cross-sectional data. But again, it's like shopping at a second-hand store -- you really have to go through a lot of junk in order to find the treasures.

Happy hunting! (and thanks to Adi)

The datasets include quite a few "junk" datasets, lots with no description. But there are a few interesting ones: FDIC is a "scrubbed list of FDIC institutions removing inactive entities and stripping all columns apart from Assets, ROE, ROA, Offices (Branches), and State". It includes 8711 observations. Another is Absorption Coefficients of Common Materials - I can just see the clustering exercise! Or the 2006 Top 100 Video Games by Sales. There are social-network data, time series, and cross-sectional data. But again, it's like shopping at a second-hand store -- you really have to go through a lot of junk in order to find the treasures.

Happy hunting! (and thanks to Adi)

## Tuesday, March 06, 2007

### Accuracy measures

There is a host of metrics for evaluating predictive performance. They are all based on aggregating the forecast errors in some form. The two most famous metrics are RMSE (Root-mean-squared-error) and MAPE (Mean-Absolute-Percentage-Error). In an earlier posting (Feb-23-2006) I disclosed a secret deciphering method for computing these metrics.

Although these two have been the most popular in software, competitions, and published papers, they have their shortages. One serious flaw of the MAPE is that zero counts contribute to the MAPE the value of infinity (because of the division by zero). One solution is to leave the zero counts out of the computation, but then these counts and their predictive error must be reported separately.

I found a very good survey paper of various metrics, which lists the many different metrics and their advantages and weaknesses. The paper, Another look at measures of forecast accuracy,(International Journal of Forecasting, 2006), by Hindman and Koehler, concludes that the best metric to use is the Mean Absolute Scaled Error, which has the mean acronym MASE.

Although these two have been the most popular in software, competitions, and published papers, they have their shortages. One serious flaw of the MAPE is that zero counts contribute to the MAPE the value of infinity (because of the division by zero). One solution is to leave the zero counts out of the computation, but then these counts and their predictive error must be reported separately.

I found a very good survey paper of various metrics, which lists the many different metrics and their advantages and weaknesses. The paper, Another look at measures of forecast accuracy,(International Journal of Forecasting, 2006), by Hindman and Koehler, concludes that the best metric to use is the Mean Absolute Scaled Error, which has the mean acronym MASE.

## Thursday, March 01, 2007

### Lots of real time series data!

I love data-mining or statistics competitions - they always provide great real data! However, the big difference between a gold mine and "just some data" is whether the data description and their context is complete. This reflects, in my opinion, the difference between "data mining for the purpose of data mining" vs. "data mining for business analytics" (or any other field of interest, such as engineering or biology).

Last year, the BICUP2006 posted an interesting dataset on bus ridership in Santiego de Chile. Although there was a reasonable description of the data (number of passengers at a bus stations at 15-minute intervals), there was no information on the actual context of the problem. The goal of the competition was to accuractly forecast 3 days into the future of the data given. Although this has its challenges, the main question is whether a method that accurately predicts these 3 days would be useful for the Santiago Transportation Bureau, or anyone else outside of the competition. For instance, the training data included 3 weeks, where there is a pronounced weekday/weekend effect. However, the prediction set include only 3 weekdays. A method that predicts accuractly weekdays might suffer on weekends. It is therefore imperative to include the final goal of the analysis. Will this forecaster be used to assist in bus scheduling on weekdays only? during rush hours only? How accurate do the forecasts need to be for practical use? Maybe a really simple model predicts accuractly enough for the purpose at hand.

Another such instance is the upcoming NN3 Forecasting Competition (as part of the 2007 International Symposium on Forecasting). The dataset includes 111 time series, varying in length (about 40-140 time points). However, I have not found any description neither of the data nor of the context. In reality we would always know at least the time frequency: are these measurements every second? minute? day? month? year? This information is obviously important for determining factors like seasonality and which methods are appropriate.

To download the data and see a few examples, you will need to register your email.

An example of a gold mine is the T-competition, which concentrates on forecasting transportation data. In addition to the large number of series (ranging in length and at various frequencies from daily to yearly), there is a solid description of what each series is, and the actual dates of measurement. They even include a set of seasonal indexes for each series. The data come from an array of transportation measurements in both Europe and North America.

Last year, the BICUP2006 posted an interesting dataset on bus ridership in Santiego de Chile. Although there was a reasonable description of the data (number of passengers at a bus stations at 15-minute intervals), there was no information on the actual context of the problem. The goal of the competition was to accuractly forecast 3 days into the future of the data given. Although this has its challenges, the main question is whether a method that accurately predicts these 3 days would be useful for the Santiago Transportation Bureau, or anyone else outside of the competition. For instance, the training data included 3 weeks, where there is a pronounced weekday/weekend effect. However, the prediction set include only 3 weekdays. A method that predicts accuractly weekdays might suffer on weekends. It is therefore imperative to include the final goal of the analysis. Will this forecaster be used to assist in bus scheduling on weekdays only? during rush hours only? How accurate do the forecasts need to be for practical use? Maybe a really simple model predicts accuractly enough for the purpose at hand.

Another such instance is the upcoming NN3 Forecasting Competition (as part of the 2007 International Symposium on Forecasting). The dataset includes 111 time series, varying in length (about 40-140 time points). However, I have not found any description neither of the data nor of the context. In reality we would always know at least the time frequency: are these measurements every second? minute? day? month? year? This information is obviously important for determining factors like seasonality and which methods are appropriate.

To download the data and see a few examples, you will need to register your email.

An example of a gold mine is the T-competition, which concentrates on forecasting transportation data. In addition to the large number of series (ranging in length and at various frequencies from daily to yearly), there is a solid description of what each series is, and the actual dates of measurement. They even include a set of seasonal indexes for each series. The data come from an array of transportation measurements in both Europe and North America.

## Tuesday, February 13, 2007

### The magical sample size in polls

Now that political polls are a hot item, it is time to unveil the mysterious sentence that accompanies many public opinion polls (not only political) -- This typically reads "the poll included 1033 adults and has a sampling error of plus or minus three percentage points".

No matter what population is being sampled, the sample size is typically around 1,000 and the precision is almost always "ۭ±3%" (this is called the margin of error).

If you type "poll" in Google you will find plenty of examples. One example is the Jan 2, 2007 NYT Business section article "Investors Greet New Year With Ambivalence". It concludes that

Discover card also runs their own survey to measure economic confidence of small business owners. Their survey goes to "approximately 1,000 small business owners" and explain that "The margin of error for the sample involving small business owners is approximately+/- 3.2 percentage points with a 95 percent level of confidence".

So how does this work? To specify a sample size one must consider

In polls, the parameter of interest is a population proportion,

You will notice that the largest possible variance is when

Regarding population size, in polls it is typically assumed to be very large, so there is no need for a correction factor. Finally, the popular significance level used is 5%, which corresponds to approximately Z0.025=2.

AND NOW, LADIES AND GENTLEMEN, WE GET:

No matter what population is being sampled, the sample size is typically around 1,000 and the precision is almost always "ۭ±3%" (this is called the margin of error).

If you type "poll" in Google you will find plenty of examples. One example is the Jan 2, 2007 NYT Business section article "Investors Greet New Year With Ambivalence". It concludes that

"Having enjoyed a year that was better than average in the stock market and a much weaker one in housing, home owners and investors appear neither exuberant nor glum about 2007."This result is based on "The telephone survey was conducted from Dec. 8 to 10 and included 922 adults nationwide and has a sampling error of plus or minus three percentage points."

Discover card also runs their own survey to measure economic confidence of small business owners. Their survey goes to "approximately 1,000 small business owners" and explain that "The margin of error for the sample involving small business owners is approximately+/- 3.2 percentage points with a 95 percent level of confidence".

So how does this work? To specify a sample size one must consider

- the population size from which the sample is taken (if it is small, then a correction factor needs to be taken into account),
- the precision of the estimator (how much variability from sample to sample we tolerate), and
- the statistical confidence level or equivalently, the significance level (denoted Alpha) with the corresponding normal distribution percentile Zalpha.

3% = estimator standard deviation * correction factor * Zalpha/2

In polls, the parameter of interest is a population proportion,

*p*, e.g., the proportion of Democratic voters in the US. The sample estimator is simply the*sample proportion*of interest (e.g., the proportion of Democratic voters in the sample). This estimator has a variance equal to*√p(1-p)/n*, where*n*is the sample size.You will notice that the largest possible variance is when

*p=0.5*. This helps determine a conservative threshold on the estimator precision (#2 above). So now we have
3% = √0.25/n * correction factor * Zalpha/2

Regarding population size, in polls it is typically assumed to be very large, so there is no need for a correction factor. Finally, the popular significance level used is 5%, which corresponds to approximately Z0.025=2.

AND NOW, LADIES AND GENTLEMEN, WE GET:

3% = √0.25/n * 2 = 1/√n

If you plug in

*n=1,000*on the right-hand-side, you will get approximately 3%, which is the relationship used between sample size and margin of error in most public opinion polls. The part that people often find surprising is that no matter how large the population, be it 10,000 or 300,000, 000, the same sample size is required for obtaining this level of precision and this level of statistical significance.## Friday, February 02, 2007

### The legendary threshold of 5% for p-values

Almost every introductory course in statistics gets to a point where the concept of the

Another "fact" that usually accompanies the p-value concept is the 5% threshold. One typically learns to compare the p-value (that is computed from the data) to a 5% threshold, and if it is below that threshold, then the effect is statistically significant.

Where does the 5% come from? I pondered on that at some point. Since a p-value can be thought of as a measure of risk, then 5% is pretty arbitrary. Obviously some applications warrant lower risk levels, while others might tolerate higher levels. According to Jerry Dallal's webpage, the reason is historical: before the age of computers, tables were used for computing p-values. In Fisher's original tables the levels computed were 5% and a few others. The rest, as they say, is history.

**p-value**is introduced. This is a tough concept and usually takes time to absorb. It is also usually one of the hardest concepts for students to internalize. An interesting paper by Hubbard and Armstrong discuss the confusion in marketing research which takes place in textbooks and journal articles.Another "fact" that usually accompanies the p-value concept is the 5% threshold. One typically learns to compare the p-value (that is computed from the data) to a 5% threshold, and if it is below that threshold, then the effect is statistically significant.

Where does the 5% come from? I pondered on that at some point. Since a p-value can be thought of as a measure of risk, then 5% is pretty arbitrary. Obviously some applications warrant lower risk levels, while others might tolerate higher levels. According to Jerry Dallal's webpage, the reason is historical: before the age of computers, tables were used for computing p-values. In Fisher's original tables the levels computed were 5% and a few others. The rest, as they say, is history.

## Friday, January 19, 2007

### How good data can lead to wrong decisions

Consumer Reports has just withdrawn a report on infant car seat test results. Apparently, the testing wrecked most of the car seats. This, in turn, has been reported to cause many parents to start doubting the usefulness of infant car seats! So what happened?

The press release describes the aim of the study:

In particular, "

What apparently happened was that the actual test was performed at a much higher speed... Consumer Reports decided to withdraw the report and do some more testing.

Was the study design faulty? Probably not. Were the data faulty? Probably not. It looks more like a failed link between the study originators and executers. The moral is that even a solid study design and a reliable execution can lead to disastrous results if communication is broken.

The press release describes the aim of the study:

*"The original study, published in the February issue of Consumer Reports, was aimed at discovering how infant seats performed in tests at speeds that match those used in the government’s New Car Assessment Program (NCAP).*"In particular, "

*Our tests were intended to simulate side crashes at the NCAP speed of 38 mph*."What apparently happened was that the actual test was performed at a much higher speed... Consumer Reports decided to withdraw the report and do some more testing.

Was the study design faulty? Probably not. Were the data faulty? Probably not. It looks more like a failed link between the study originators and executers. The moral is that even a solid study design and a reliable execution can lead to disastrous results if communication is broken.

## Wednesday, January 17, 2007

### Havoc in the land of freakonomics

I just stumbled upon a short article in the

*Economist's Voice*called "Freak-Freakonomics". The author, Prof. Ariel Rubinstein, attacks Levitt's book Freakonomics by claiming that it really has nothing to do with economics. The article is very critical of the book and spares no arrows. It almost looks like the "war of economists who want to get published". Although there have been some "wars of statisticians" over the centuries (e.g., that between classic and Bayesian statisticians), here the war is not over the data or the modeling. It is over the importance of the questions asked. I bet that the outcome of the publication of this article will actually be an increase in the sales of Freakonomics...
Subscribe to:
Posts (Atom)