Thursday, March 29, 2007

Stock performance and CEO house size study

The BusinessWeek article "The CEO Mega-Mansion Factor" (April 2, 2007) definitely caught my attention -- Two finance professors (Liu and Yermack) collected data on house sizes of CEOs of the S&P 500 companies in 2004. Their theory is "If home purchases represent a signal of commitment by the CEO, subsequent stock performance of the company should at least remain unchanged and possibly improve. Conversely, if home purchases represent a signal of entrenchment, we would expect stock performance to decline after the time of purchase." The article summarizes the results: "[they] found that 12% of [CEOs] lived in homes of at least 10,000 square feet, or a minimum of 10 acres. And their companies' stocks? In 2005 they lagged behind those of S&P 500 CEOs living in smaller houses by 7%, on average".

At this point I had to find out more details! I tracked the research article called "Where are the shareholder's mansions? CEOs' home purchases, stock sales, and subsequent company performance", which contains further details about the data and the analysis. The authors describe the tedious job of assembling the house information data from multiple databases, dealing with missing values and differences in information from different sources. A few questions come to mind:
  1. A plot of value of CEO residence vs. CEO tenure in office (both in log scale) has a suspicious fan-shape, indicating that the variability in residence value increases in CEO tenure. If this is true, it would mean that the fitted regression line (with slope .15) is not an adequate model and therefore its interpretation not valid. A simple look at the residuals would give the answer.
  2. The exploratory step indicates a gap between the performance of below-median CEO house sizes and above-median houses. Now the question is whether the difference is random or reflects a true difference. In order to test the statistical significance of these differences the researchers had to define "house size". They decided to do the following (due to missing values):
    "We adopt a simple scheme for classifying a CEO’s residence as “large” if it has either 10,000 square feet of floor area or at least 10 acres of land." While this rule is somewhat ad hoc, it fits our data nicely by identifying about 15% of the sample residences as extremely large.Since this is an arbitrary cutoff, it is important to evaluate its effect on the results: what happens if other cutoffs are used? Is there a better way to combine the information that is not missing in order to obtain a better metric?
  3. The main statistical tests, which compare the stock performances of different types of houses (above- vs. below-median market values; "large" vs. not-"large" homes), are a series of t-tests for comparing means and Wilcoxon tests for comparing medians. Of all 8 performed tests, only one ended up with a p-value below 5%. The one exception is a difference between median stock performance of "large-home" CEOs and "not-large home" CEOs. Recall that this is based on the arbitrary definition of a "large" home. In other words, the differences in stock performances do not appear to be strongly statistically significant. This might improve as the sample sizes are increased -- a large number of observations was dropped due to missing values.
  4. Finally, another interesting point is how the model can be used. BusinessWeek quotes Yermack: "If [the CEO] buys a big mansion, sell the stock". Such a claim means that house size is predictive of stock performance. However, the model (as described in the research paper) was not constructed as a predictive model: there is no holdout set to evaluate predictive accuracy, and no predictive measures are mentioned. Finding a statistically significant relationship between house size and subsequent stock performance is not necessarily indicative of predictive power.

Tuesday, March 20, 2007

Google purchases data visualization tool

Once again, some hot news from my ex-student Adi Gadwale: Google recently purchased a data visualization tool from Professor Hans Rosling at Stockholm's Karolinska Institute (read the story). Adi also sent me the link to Gapminder, the tool that Google has put out. For those of us who've become addicts of the interactive visualization tool Spotfire, this looks pretty familiar!

Wednesday, March 07, 2007

Multiple Testing

My colleague Ralph Russo often comes up with memorable examples for teaching complicated concepts. He recently sent me an Economist article called "Signs of the Times" that shows the absurd results that can be obtained if multiple testing is not taken into account.

Multiple testing arises when the same data are used simultaneously for testing many hypotheses. The problem is a huge inflation in the type I error (i.e., rejecting the null hypothesis in error). Even if each single hypothesis is carried out at a low significance level (e.g., the infamous 5% level), the aggregate type I error becomes huge very fast. In fact, if testing k hypotheses that are independent of each other, each at significance level alpha, then the total type I error is 1-(1-alpha)^k. That's right - it grows exponentially. For example, if we test 7 independent hypotheses at a 10% significance level, the overall type I error is 52%. In other words, even if none of these hypotheses are true, we will see on average more than half of the p-values below 10%.

In the Economist article, Dr. Austin tests a set of multiple absurd "medical" hypotheses (such as "people born under the astrological sign of Leo are 15% more likely to be admitted to hospital with gastric bleeding than those born under the other 11 signs"). He shows that some of these hypotheses are "supported by the data", if we ignore multiple testing.

There is a variety of solutions for multiple testing, some older (such as the classic Bonferonni correction) and some more recent (such as the False Discovery Rate). But most importantly, this issue should be recognized.

Source for data

Adi Gadwale, a student in my 2004 MBA Data Mining class, still remembers my fetish with business data and data visualization. He just sent me a link to an IBM Research website called Many Eyes, which includes user-submitted datasets as well as Java-applet visualizations.

The datasets include quite a few "junk" datasets, lots with no description. But there are a few interesting ones: FDIC is a "scrubbed list of FDIC institutions removing inactive entities and stripping all columns apart from Assets, ROE, ROA, Offices (Branches), and State". It includes 8711 observations. Another is Absorption Coefficients of Common Materials - I can just see the clustering exercise! Or the 2006 Top 100 Video Games by Sales. There are social-network data, time series, and cross-sectional data. But again, it's like shopping at a second-hand store -- you really have to go through a lot of junk in order to find the treasures.

Happy hunting! (and thanks to Adi)

Tuesday, March 06, 2007

Accuracy measures

There is a host of metrics for evaluating predictive performance. They are all based on aggregating the forecast errors in some form. The two most famous metrics are RMSE (Root-mean-squared-error) and MAPE (Mean-Absolute-Percentage-Error). In an earlier posting (Feb-23-2006) I disclosed a secret deciphering method for computing these metrics.

Although these two have been the most popular in software, competitions, and published papers, they have their shortages. One serious flaw of the MAPE is that zero counts contribute to the MAPE the value of infinity (because of the division by zero). One solution is to leave the zero counts out of the computation, but then these counts and their predictive error must be reported separately.

I found a very good survey paper of various metrics, which lists the many different metrics and their advantages and weaknesses. The paper, Another look at measures of forecast accuracy,(International Journal of Forecasting, 2006), by Hindman and Koehler, concludes that the best metric to use is the Mean Absolute Scaled Error, which has the mean acronym MASE.

Thursday, March 01, 2007

Lots of real time series data!

I love data-mining or statistics competitions - they always provide great real data! However, the big difference between a gold mine and "just some data" is whether the data description and their context is complete. This reflects, in my opinion, the difference between "data mining for the purpose of data mining" vs. "data mining for business analytics" (or any other field of interest, such as engineering or biology).

Last year, the BICUP2006 posted an interesting dataset on bus ridership in Santiego de Chile. Although there was a reasonable description of the data (number of passengers at a bus stations at 15-minute intervals), there was no information on the actual context of the problem. The goal of the competition was to accuractly forecast 3 days into the future of the data given. Although this has its challenges, the main question is whether a method that accurately predicts these 3 days would be useful for the Santiago Transportation Bureau, or anyone else outside of the competition. For instance, the training data included 3 weeks, where there is a pronounced weekday/weekend effect. However, the prediction set include only 3 weekdays. A method that predicts accuractly weekdays might suffer on weekends. It is therefore imperative to include the final goal of the analysis. Will this forecaster be used to assist in bus scheduling on weekdays only? during rush hours only? How accurate do the forecasts need to be for practical use? Maybe a really simple model predicts accuractly enough for the purpose at hand.

Another such instance is the upcoming NN3 Forecasting Competition (as part of the 2007 International Symposium on Forecasting). The dataset includes 111 time series, varying in length (about 40-140 time points). However, I have not found any description neither of the data nor of the context. In reality we would always know at least the time frequency: are these measurements every second? minute? day? month? year? This information is obviously important for determining factors like seasonality and which methods are appropriate.
To download the data and see a few examples, you will need to register your email.

An example of a gold mine is the T-competition, which concentrates on forecasting transportation data. In addition to the large number of series (ranging in length and at various frequencies from daily to yearly), there is a solid description of what each series is, and the actual dates of measurement. They even include a set of seasonal indexes for each series. The data come from an array of transportation measurements in both Europe and North America.