BzST | Business Analytics, Statistics, Teaching: August 2010

Tuesday, August 03, 2010

The PCA Debate

Recently a posting on the Research Methods Linked-In group asked what is Principal Components Analysis (PCA) in laymen terms and what is it useful for. The answers clearly reflected the two "camps": social science researchers and data miners. For data miners PCA is a popular and useful data reduction method for reducing the dimension of dataset with many variables. For social scientists PCA is a type of factor analysis without a rotation step. The last sentence might sound cryptic to a non-social-scientist, so a brief explanation is in place: The goal of rotation is to simplify and clarify the interpretation of the principal components relative to each of the original variables. This is achieved by optimizing some criterion (see http://en.wikipedia.org/wiki/Factor_analysis#Rotation_methods for details).

Now here comes the explain vs. predict divide:

PCA and factor analysis often produce practically similar results in terms of "rearranging" the total variance of the data. Hence, PCA is by far more common in data mining compared to Factor Analysis. In contrast, PCA is considered by social scientists to be inferior to Factor Analysis because their goal is to uncover underlying theoretical constructs. Costello & Osborne (in the 2005 issue of the online journal Practical Assessment, Research& Evaluation) give an overview of PCA and factor analysis, discuss the debate between the two, and summarize:

We suggest that factor analysis is preferable to principal components analysis. Components analysis is only a data reduction method. It became common
decades ago when computers were slow and expensive to use; it was a quicker, cheaper alternative to factor analysis... However, researchers rarely collect and analyze data without an a priori idea about how the variables are related (Floyd & Widaman, 1995). The aim of factor analysis is to reveal any latent variables that cause the manifest variables to covary.

Moreover, the choice of rotation method can lead to either correlated or uncorrelated factors. While data miners would tend to opt for uncorrelated factors (and therefore would stick to the uncorrelated principal components with no rotation at all), social scientists often choose a rotation that leads to correlated factors! Why? Costello & Osborne explain: "In the social sciences we generally expect some correlation among factors, since behavior is rarely partitioned into neatly packaged units that function independently of one another."

At the end of the day, it comes down to the different places that causal-explanatory scientist and data miners take on the data-theory continuum. In the social sciences, researchers assume an underlying causal theory before considering any data or analysis. The "manifest world" is only useful for uncovering the "latent world". Hence, data and analysis methods are viewed only through the lens of theory. In contrast, in data mining the focus is at the data level, or the "manifest world", because often there is no underlying theory, or because the goal is to predict new (manifest) data or to capture an association at the measurable data level.