Thursday, March 26, 2009

Principal Components Analysis vs. Factor Analysis

Here is an interesting example of how similar mechanics lead to two very different statistical tools. Principal Components Analysis (PCA) is a powerful method for data compression, in the sense of capturing the information contained in a large set of variables by a smaller set of linear combinations of those variables. As such, it is widely used in applications that require data compression, such as visualization of high-dimensional data and prediction.

Factor Analysis (FA), technically considered a close cousin of PCA, is popular in the social sciences, and is used for the purpose of discovering a small number of 'underlying factors' from a larger set of observable variables. Although PCA and FA are both based on orthogonal linear combinations of the original variables, they are very different conceptually: FA tries to relate the measured variables to underlying theoretical concepts, while PCA operates only at the measurement level. The former is useful for explaining; the latter for data reduction (and therefore prediction).

Richard Darlington, a Professor Emeritus of Psychology at Cornell, has a nice webpage describing the two. He tries to address the confusion between PCA and FA by first introducing FA and only then PCA, which is the opposite of what you'll find in textbooks. Darlington comments:
I have introduced principal component analysis (PCA) so late in this chapter primarily for pedagogical reasons. It solves a problem similar to the problem of common factor analysis, but different enough to lead to confusion. It is no accident that common factor analysis was invented by a scientist (differential psychologist Charles Spearman) while PCA was invented by a statistician. PCA states and then solves a well-defined statistical problem, and except for special cases always gives a unique solution with some very nice mathematical properties. One can even describe some very artificial practical problems for which PCA provides the exact solution. The difficulty comes in trying to relate PCA to real-life scientific problems; the match is simply not very good.
Machine learners are very familiar with PCA as well as other compression-type algorithms such as Singular Value Decomposition (the most heavily used compression technique in the Netflix Prize competition). Such compression methods are also used as alternatives to variable selection algorithms, such as forward selection and backward elimination. Rather than retain or remove "complete" variables, combinations of them are used.

I recently learned of Independent Components Analysis (ICA) from Scott Nestler, a former PhD student in our department. He used ICA in his dissertation on portfolio optimization. The idea is similar to PCA, except that the resulting components are not only uncorrelated, but actually independent.

Wednesday, March 25, 2009

Are experiments always better?

This continues my "To Explain or To Predict?" argument (in brief: statistical models aimed at causal explanation will not necessarily be good predictors). And now, I move to a very early stage in the study design: how should we collect data?

A well-known notion is that experiments are preferable to observational studies. The main difference between experimental studies and observational studies is an issue of control. In experiments, the researcher can deliberately choose "treatments" and control the assignment of subjects to the "treatments", and then can measure the outcome. Whereas in observational studies, the researcher can only observe the subjects and measure variables of interest.

An experimental setting is therefore considered "cleaner": you manipulate what you can, and randomize what you can't (like the famous saying Block what you can and randomize what you can’t). In his book Observational Studies, Paul Rosenbaum writes "Experiments are better than observational studies because there are fewer grounds for doubt." (p. 11).

Better for what purpose?

I claim that sometimes observational data are preferable. Why is that? well, it all depends on the goal. If the goal is to infer causality, then indeed an experimental setting wins hands down (if feasible, of course). However, what if the goal is to accurately predict some measure for new subjects? Say, to predict which statisticians will write blogs.

Because prediction does not rely on causal arguments but rather on associations (e.g., "statistician blog writers attend more international conferences"), the choice between an experimental and observational setting should be guided by additional considerations beyond the usual ethical, economic, and feasibility constraints. For instance, for prediction we care about the closeness of the study environment and the reality in which we will be predicting; we care about measurement quality and its availability at the time of prediction.

An experimental setting might be too clean compared to the reality in which prediction will take place, thereby eliminating the ability of a predictive model to capture authentic “noisy” behavior. Hence, if the “dirtier” observational context contains association-type information that benefits prediction, it might be preferable to an experiment.

There are additional benefits of observational data for building predictive models:
  • Predictive reality: Not only can the predictive model benefit from the "dirty" environment, but the assessment of how well the model performs (in terms of predictive accuracy) will be more realistic if tested in the "dirty" environment.
  • Effect magnitude: Even if an input is shown to cause an output in an experiment, the magnitude of the effect within the experiment might not generalize to the "dirtier" reality.
  • The unknown: even scientists don't know everything! predictive models can discover previously unknown relationships (associative or even causal). Hence, limiting ourselves to an experimental setting that is designed and limited to our knowledge, can keep our knowledge stagnant, and predictive accuracy low.
The Netflix prize competition is a good example: if the goal were to find the causal underpinnings of movie ratings by users, then an experiment may have been useful. But if the goal is to predict user ratings of movies, then observational data like those released to the public are perhaps better than an experiment.

Tuesday, March 10, 2009

What R-squared is (and is not)

R-squared (aka "coefficient of determination", or for short, R2) is a popular measure used in linear regression to assess the strength of the linear relationship between the inputs and the output. In a model with a single input, R2 is simply the squared correlation coefficient between the input and output.

If you examine a few textbooks in statistics or econometrics, you will find several definitions of R2. The most common definition is "the percent of variation in the output (Y) explained by the inputs (X's)". Another definition is "a measure of predictive power" (check out Wikepedia!). And finally, R2 is often called a goodness-of-fit measure. Try a quick Google search of "R-squared" and "goodness of fit". I even discovered this Journal of Economics article entitled An R-squared measure of goodness of fit for some common nonlinear regression models.

The first definition is correct, although it might sound overly complicated to a non-statistical ear. Nevertheless, it is correct.

As to R2 being a predictive measure, this is an unfortunately popular misconception. There are several problems with R2 that make it a poor predictive accuracy measure:
  1. R2 always increases as you add inputs, whether they contain useful information or not. This technical inflation in R2 is usually overcome by using an alternative metric (R2-adjusted), which penalized R2 for the number of inputs included.

  2. R2 is computed from a given sample of data that was used for fitting the linear regression model. Hence, it is "biased" towards those data and is therefore likely to be over-optimistic in measuring the predictive power of the model on new data. This is part of a larger issue related to performance evaluation metrics: the best way to assess the predictive power of a model is to test it on new data. To see more about this, check out my recent working paper "To Explain or To Predict?"
Finally, the popular labeling of R2 as a goodness-of-fit measure is, in fact, incorrect. This was pointed out to me by Prof. Jeff Simonoff from NYU. R2 does not measure how well the data fit the linear model but rather how strong the linear relationship is. Jeff calls it a strength-of-fit measure.

Here's a cool example (thanks Jeff!): If you simulate two columns of uncorrelated normal variables and then fit a regression to the resulting pairs (call them X and Y), you will get a very low R2 (practically zero). This indicates that there is no linear relationship between X and Y. However, the model being fitted is actually a regression of Y on X with a slope of zero. In that sense, the data do fit the zero-slope model very well, yet R2 tells us nothing of this good fit.

Monday, March 09, 2009

Start the Revolution

Variability is a key concept in statistics. The Greek letter Sigma has such importance, that it is probably associated more closely with statistics than with Greek. Yet, if you have a chance to examine the bookshelf of introductory statistics textbooks in a bookstore or the library you will notice that the variability between the zillions of textbooks, whether in engineering, business, or the social sciences, is nearly zero. And I am not only referring to price. I can close my eyes and place a bet on the topics that will show up in the table of contents of any textbook (summaries and graphs; basic probability; random variables; expected value and variance; conditional probability; the central limit theorem and sampling distributions; confidence intervals for the mean, proportion, two-groups, etc; hypothesis tests for one mean, comparing groups, etc.; linear regression) . I can also predict the order of those topics quite accurately, although there might be a tiny bit of diversity in terms of introducing regression up front and then returning to it at the end.

You may say: if it works, then why break it? Well, my answer is: no, it doesn't work. What is the goal of an introductory statistics course taken by non-statistics majors? Is it to familiarize them with buzzwords in statistics? If so, then maybe this textbook approach works. But in my eyes the goal is very different: give them a taste of how statistics can really be useful! Teach 2-3 major concepts that will stick in their minds; give them a coherent picture of when the statistics toolkit (or "technology", as David Hand calls it) can be useful.

I was recently asked by a company to develop for their managers a module on modeling input-output relationships. I chose to focus on using linear/logistic regression, with an emphasis on how it can be used for predicting new records or for explaining input-output relationships (in a different way, of course); on defining the analysis goal clearly; on the use of quantitative and qualitative inputs and output; on how to use standard errors to quantify sampling variability in the coefficients; on how to interpret the coefficients and relate them to the problem (for explanatory purposes); on how to trouble-shoot; on how to report results effectively. The reaction was "oh, we don't need all that, just teach them R-squares and p-values".

We've created monsters: the one-time students of statistics courses remember just buzzwords such as R-square and p-values, yet they have no real clue what those are and how limited they are in almost any sense.

I keep checking on the latest in statistics intro textbooks and see exercpts from the publishers. New books have this bell or that whistle (some new software, others nicer examples), but they almost always revolve around the same mishmash of topics with no clear big story to remember.

A few textbook have tried going the case-study avenue. One nice example is A Casebook for a First Course in Statistics and Data Analysis (by Chatterjee, Handcock, and Simonoff). It presents multiple "stories" with data, and how statistical methods are used to derive some insight. However, the authors suggest to use this book as an addendum to the ordinary teaching method: "The most effective way to use these cases is to study them concurrently with the statistical methodology being learned".

I've taught a "core" statistics course to audiences of engineers of different sorts and to MBAs. I had to work very hard to make the sequence of seemingly unrelated topics appear coherent, which in retrospect I do not think is possible in a single statistics course. Yes, you can show how cool and useful the concepts of expected value and variance are in the context of risk and portfolio management, or how the distribution of the mean is used effectively in control charts for monitoring industrial proceses, but then you must move on to the next chapter (usually sampling variance and the normal distribution), thereby erasing the point by piling on it totally different information. A first taste of statistics should be more pointed, more coherent, and more useful. Forget the details, focus on the big picture.

Bring on the revolution!