Saturday, June 13, 2009

Histograms in Excel

Histograms are very useful charts for displaying the distribution of a numerical measurement. The idea is to bucket the numerical measurement into intervals, and then to display the frequency (or percentage) of records in each interval.

Two ways to generate a histogram in Excel are:
  1. Create a pivot table, with the measurement of interest in the Column area, and Count of that measurement (or any measurement) in the Data area. Then, right-click the column area and "Group and Show Detail >  Group" will create the intervals. Now simply click the chart wizard to create the matching chart. You will still need to do some fixing to get a legal histogram (explanation below).
  2. Using the Data Analysis add-in (which is usually available with ordinary installation and only requires enabling it in the Tools>Add-ins menu): the Histogram function here will only create the frequency table (the name "Histogram" is misleading!). Then, you will need to create a bar chart that reads from this table, and fix it to create a legal histogram (explanation below).
Needed Fix: Histogram vs. Bar Chart
Background: Histograms and bar charts might appear similar, because in both cases the bar heights denote frequency (or percentage). However, they are different in a fundamental way: Bar charts are meant for displaying categorical measurements, while histograms are meant for displaying numerical measurements. This is reflected by the fact that in bar charts the x-axis conveys categories (e.g., "red", "blue", "green"), whereas in histograms the x-axis conveys the numerical intervals. Hence, in bar charts the order of the bars is unimportant and we can change the "red" bar with the "green" bar. In contrast, in histograms the interval order cannot be changed: the interval 20-30 can only be located between the interval 10-20 and the interval 30-40.

To convey this difference between bar charts and histograms, a major feature of a histogram is that there are no gaps between the bars (making the neighboring intervals "glue" to each other). The entire "shape" of the touching bars conveys important information, not only the single bars. Hence, the default chart that Excel creates using either of the two methods above will not be a legal and useful histogram unless you remove those gaps. To do that, double-click on any of the bars, and in the Options tab reduce the Gap to 0.

Method comparison:
The pivot-table method is much faster and yields a chart that is linked to the pivot table and is interactive. It also does not require the Data Analysis add-in. However,
there is a serious flaw with the pivot table method: if some of the intervals contain 0 records, then those intervals will be completely absent from the pivot table, which means that the chart will be missing "bars" of height zero for those intervals! The resulting histogram will therefore be wrong!


Thursday, April 23, 2009

Fragmented

In the process of planning the syllabus for my next PhD course on "Scientific Data-Collection", to be offered for the third time in Spring 2009, I have realized how fragmented the education of statisticians is, especially when considering the applied courses. The applied part of a typical degree in statistics will include a course in Design of Experiments, one on Surveys and Sampling, one on Computing with the latest statistical software (currently R), perhaps a Quality Control, and of course the usual Regression, Multivariate Analysis and other modeling courses. 

Because there is usually very little overlap between the courses (perhaps the terms "sigma" and "p-value" are repeated across them), or sometimes extreme overlap (we learned ANOVA both in a "Regression and ANOVA" course, and again in "Design of Experiments"). I initially conceptually compartmentalized courses into separate entities. Each had its own terminology and point of view. It took me a while to even get the difference between the "probability" part and the "statistics" part in the "Intro to probability and statistics" course. I can be cynical and attribute the mishmash to the diverse backgrounds of the faculty, but I suppose it is more due to "historical reasons".

After a while, to make better sense of my overall profession, I was able to cluster the courses into the broader "statistical models", "math and prob background", "computing", etc.
But truthfully, this too is very unsatisfactory. It's a very limited view of the "statistical purpose" of the tools, taking the phrases off the textbooks used in each subject.

For my Scientific Data-Collection course, where students come from wide range of business disciplines, I cover three main data collection methods (all within an Internet environment): Web collection (crawling, API, etc.), online surveys, and online/lab experiments. In a statistics curriculum you would never find such a combo. You won't even find a statistics textbook that covers all three topics. So why did we bind them? Because these are the main tools that researchers use today to gather data!

For each of the three topics we discuss how to design effective data collection schemes and tools. In additional to statistical considerations (guaranteeing that the collected data will be adequate for answering the reserach question of interest), and resource constraints (time, money, etc.), there are two additional aspects: ethical and technological. These are extremely important and are must-knows for any hands-on researcher.

Thinking of the non-statistical aspects of data collection has lead me to a broader and more conceptual view of the statistics profession. I like David Hand's definition of statistics as a technology (rather than a science). It means that we should think about our different methods as technologies within a context. Rather than thinking of our knowledge as a toolkit (with a hammer, screwdriver, and a few other tools), we should generalize across the different methods in terms of their use by non-statisticians. How and when do psychologists use surveys? experiments? regression models? T-tests? [Or are they compartmentalizing those according to the courses that they studied from Statistics faculty?] How are chemical engineers collecting, analyzing, and evaluating their data?

Ethical considerations are rarely discussed in statistics courses, although they usually are discussed in "research methods" grad courses in the social sciences. Yet, ethical considerations are all very closely related to the statistical design. Limitations on sample size can arise due to copyright law (web-crawling), due to safety of patients (clinical trials), or to non-response rates (surveys). Not to mention that every academic involved in human subjects research should be educated about Institutional Review Boards and the study approval process. Similarly, technological issues are closely related to sample size and the quality of the generated data. Servers that are down during a web crawl (or due to the web crawl!), email surveys that are not properly displayed on Firefox or caught by spam filters, or overly-sophisticated technological experiments are all issues that statistics students should also be educated about.

I propose to re-design the statistics curriculum around two coherent themes: "Modeling and Data Analysis Technologies", and "Data Collection / Study Design Technologies". An intro course should present these two different components, their role, and their links so that students will have context.

And, of course, the "Modeling and Data Analysis" theme should be clearly broken down into "explaining", "predicting", and "describing".

Saturday, April 18, 2009

Collecting online data (for research)

In the new era of large amounts of publicly available data, an issue that is sometimes overlooked is ethical data collection. Whereas for experimental studies involving humans we have clear guidelines and an organizational process for assessing and approving data collection (in the US, via the IRB), collecting observational data is much more ambiguous. For instance, if I want to collect data on 50,000 book titles on Amazon, including their ratings, reviews, and cover images - is it ethical to collect this information by web crawling? A first thought might be "why not? the information is there and I am not taking anything from anyone". However, there are hidden costs and risks here that must be considered. First, in the above example, the web crawler will be mimicking manual browsing, thereby accessing Amazon's server. This is one cost to Amazon. Secondly, Amazon posts this information for buyers for the purpose of generating revenue. When one's intention is not to actually purchase, then it is misuse of the public information. Finally, one must ask whether there is any risk to the data provider (for instance - maybe too heavy access can slow down the provider's server, thereby slowing down or even denying access to actual potential buyers).

When the goal of the data collection is research, then another factor to consider is the benefits of the research study to society, to scientific research or "general knowledge", and perhaps even to the company.

Good practice involves consideration of the costs, risks, and benefits to the data provider and accordingly designing your collection and letting the data provider know about your intention. Careful consideration of actual sample size is therefore still important even in this new environment. An interesting paper by Allen, Burk, and Davis (Academic Data Collection in Electronic Environments: Defining Acceptable Use of Internet Resources discusses these issues and offers guidelines for "acceptable use" of internet resources.

These days more and more companies (e.g., eBay and Amazon) are moving to "push" technology, where they make their data available for collection via API and RSS technologies. Obtaining data in this way avoids the ethical and legal considerations, but one is then limited to the data that the data source has chosen to provide. Moreover, the amount of data is usually limited. Hence, I believe that web crawling will continue to be used, but in combination with API and RSS the extent of crawling can be reduced.

Thursday, March 26, 2009

Principal Components Analysis vs. Factor Analysis

Here is an interesting example of how similar mechanics lead to two very different statistical tools. Principal Components Analysis (PCA) is a powerful method for data compression, in the sense of capturing the information contained in a large set of variables by a smaller set of linear combinations of those variables. As such, it is widely used in applications that require data compression, such as visualization of high-dimensional data and prediction.

Factor Analysis (FA), technically considered a close cousin of PCA, is popular in the social sciences, and is used for the purpose of discovering a small number of 'underlying factors' from a larger set of observable variables. Although PCA and FA are both based on orthogonal linear combinations of the original variables, they are very different conceptually: FA tries to relate the measured variables to underlying theoretical concepts, while PCA operates only at the measurement level. The former is useful for explaining; the latter for data reduction (and therefore prediction).

Richard Darlington, a Professor Emeritus of Psychology at Cornell, has a nice webpage describing the two. He tries to address the confusion between PCA and FA by first introducing FA and only then PCA, which is the opposite of what you'll find in textbooks. Darlington comments:
I have introduced principal component analysis (PCA) so late in this chapter primarily for pedagogical reasons. It solves a problem similar to the problem of common factor analysis, but different enough to lead to confusion. It is no accident that common factor analysis was invented by a scientist (differential psychologist Charles Spearman) while PCA was invented by a statistician. PCA states and then solves a well-defined statistical problem, and except for special cases always gives a unique solution with some very nice mathematical properties. One can even describe some very artificial practical problems for which PCA provides the exact solution. The difficulty comes in trying to relate PCA to real-life scientific problems; the match is simply not very good.
Machine learners are very familiar with PCA as well as other compression-type algorithms such as Singular Value Decomposition (the most heavily used compression technique in the Netflix Prize competition). Such compression methods are also used as alternatives to variable selection algorithms, such as forward selection and backward elimination. Rather than retain or remove "complete" variables, combinations of them are used.

I recently learned of Independent Components Analysis (ICA) from Scott Nestler, a former PhD student in our department. He used ICA in his dissertation on portfolio optimization. The idea is similar to PCA, except that the resulting components are not only uncorrelated, but actually independent.

Wednesday, March 25, 2009

Are experiments always better?

This continues my "To Explain or To Predict?" argument (in brief: statistical models aimed at causal explanation will not necessarily be good predictors). And now, I move to a very early stage in the study design: how should we collect data?

A well-known notion is that experiments are preferable to observational studies. The main difference between experimental studies and observational studies is an issue of control. In experiments, the researcher can deliberately choose "treatments" and control the assignment of subjects to the "treatments", and then can measure the outcome. Whereas in observational studies, the researcher can only observe the subjects and measure variables of interest.

An experimental setting is therefore considered "cleaner": you manipulate what you can, and randomize what you can't (like the famous saying Block what you can and randomize what you can’t). In his book Observational Studies, Paul Rosenbaum writes "Experiments are better than observational studies because there are fewer grounds for doubt." (p. 11).

Better for what purpose?

I claim that sometimes observational data are preferable. Why is that? well, it all depends on the goal. If the goal is to infer causality, then indeed an experimental setting wins hands down (if feasible, of course). However, what if the goal is to accurately predict some measure for new subjects? Say, to predict which statisticians will write blogs.

Because prediction does not rely on causal arguments but rather on associations (e.g., "statistician blog writers attend more international conferences"), the choice between an experimental and observational setting should be guided by additional considerations beyond the usual ethical, economic, and feasibility constraints. For instance, for prediction we care about the closeness of the study environment and the reality in which we will be predicting; we care about measurement quality and its availability at the time of prediction.

An experimental setting might be too clean compared to the reality in which prediction will take place, thereby eliminating the ability of a predictive model to capture authentic “noisy” behavior. Hence, if the “dirtier” observational context contains association-type information that benefits prediction, it might be preferable to an experiment.

There are additional benefits of observational data for building predictive models:
  • Predictive reality: Not only can the predictive model benefit from the "dirty" environment, but the assessment of how well the model performs (in terms of predictive accuracy) will be more realistic if tested in the "dirty" environment.
  • Effect magnitude: Even if an input is shown to cause an output in an experiment, the magnitude of the effect within the experiment might not generalize to the "dirtier" reality.
  • The unknown: even scientists don't know everything! predictive models can discover previously unknown relationships (associative or even causal). Hence, limiting ourselves to an experimental setting that is designed and limited to our knowledge, can keep our knowledge stagnant, and predictive accuracy low.
The Netflix prize competition is a good example: if the goal were to find the causal underpinnings of movie ratings by users, then an experiment may have been useful. But if the goal is to predict user ratings of movies, then observational data like those released to the public are perhaps better than an experiment.

Tuesday, March 10, 2009

What R-squared is (and is not)

R-squared (aka "coefficient of determination", or for short, R2) is a popular measure used in linear regression to assess the strength of the linear relationship between the inputs and the output. In a model with a single input, R2 is simply the squared correlation coefficient between the input and output.

If you examine a few textbooks in statistics or econometrics, you will find several definitions of R2. The most common definition is "the percent of variation in the output (Y) explained by the inputs (X's)". Another definition is "a measure of predictive power" (check out Wikepedia!). And finally, R2 is often called a goodness-of-fit measure. Try a quick Google search of "R-squared" and "goodness of fit". I even discovered this Journal of Economics article entitled An R-squared measure of goodness of fit for some common nonlinear regression models.

The first definition is correct, although it might sound overly complicated to a non-statistical ear. Nevertheless, it is correct.

As to R2 being a predictive measure, this is an unfortunately popular misconception. There are several problems with R2 that make it a poor predictive accuracy measure:
  1. R2 always increases as you add inputs, whether they contain useful information or not. This technical inflation in R2 is usually overcome by using an alternative metric (R2-adjusted), which penalized R2 for the number of inputs included.

  2. R2 is computed from a given sample of data that was used for fitting the linear regression model. Hence, it is "biased" towards those data and is therefore likely to be over-optimistic in measuring the predictive power of the model on new data. This is part of a larger issue related to performance evaluation metrics: the best way to assess the predictive power of a model is to test it on new data. To see more about this, check out my recent working paper "To Explain or To Predict?"
Finally, the popular labeling of R2 as a goodness-of-fit measure is, in fact, incorrect. This was pointed out to me by Prof. Jeff Simonoff from NYU. R2 does not measure how well the data fit the linear model but rather how strong the linear relationship is. Jeff calls it a strength-of-fit measure.

Here's a cool example (thanks Jeff!): If you simulate two columns of uncorrelated normal variables and then fit a regression to the resulting pairs (call them X and Y), you will get a very low R2 (practically zero). This indicates that there is no linear relationship between X and Y. However, the model being fitted is actually a regression of Y on X with a slope of zero. In that sense, the data do fit the zero-slope model very well, yet R2 tells us nothing of this good fit.

Monday, March 09, 2009

Start the Revolution

Variability is a key concept in statistics. The Greek letter Sigma has such importance, that it is probably associated more closely with statistics than with Greek. Yet, if you have a chance to examine the bookshelf of introductory statistics textbooks in a bookstore or the library you will notice that the variability between the zillions of textbooks, whether in engineering, business, or the social sciences, is nearly zero. And I am not only referring to price. I can close my eyes and place a bet on the topics that will show up in the table of contents of any textbook (summaries and graphs; basic probability; random variables; expected value and variance; conditional probability; the central limit theorem and sampling distributions; confidence intervals for the mean, proportion, two-groups, etc; hypothesis tests for one mean, comparing groups, etc.; linear regression) . I can also predict the order of those topics quite accurately, although there might be a tiny bit of diversity in terms of introducing regression up front and then returning to it at the end.

You may say: if it works, then why break it? Well, my answer is: no, it doesn't work. What is the goal of an introductory statistics course taken by non-statistics majors? Is it to familiarize them with buzzwords in statistics? If so, then maybe this textbook approach works. But in my eyes the goal is very different: give them a taste of how statistics can really be useful! Teach 2-3 major concepts that will stick in their minds; give them a coherent picture of when the statistics toolkit (or "technology", as David Hand calls it) can be useful.

I was recently asked by a company to develop for their managers a module on modeling input-output relationships. I chose to focus on using linear/logistic regression, with an emphasis on how it can be used for predicting new records or for explaining input-output relationships (in a different way, of course); on defining the analysis goal clearly; on the use of quantitative and qualitative inputs and output; on how to use standard errors to quantify sampling variability in the coefficients; on how to interpret the coefficients and relate them to the problem (for explanatory purposes); on how to trouble-shoot; on how to report results effectively. The reaction was "oh, we don't need all that, just teach them R-squares and p-values".

We've created monsters: the one-time students of statistics courses remember just buzzwords such as R-square and p-values, yet they have no real clue what those are and how limited they are in almost any sense.

I keep checking on the latest in statistics intro textbooks and see exercpts from the publishers. New books have this bell or that whistle (some new software, others nicer examples), but they almost always revolve around the same mishmash of topics with no clear big story to remember.

A few textbook have tried going the case-study avenue. One nice example is A Casebook for a First Course in Statistics and Data Analysis (by Chatterjee, Handcock, and Simonoff). It presents multiple "stories" with data, and how statistical methods are used to derive some insight. However, the authors suggest to use this book as an addendum to the ordinary teaching method: "The most effective way to use these cases is to study them concurrently with the statistical methodology being learned".

I've taught a "core" statistics course to audiences of engineers of different sorts and to MBAs. I had to work very hard to make the sequence of seemingly unrelated topics appear coherent, which in retrospect I do not think is possible in a single statistics course. Yes, you can show how cool and useful the concepts of expected value and variance are in the context of risk and portfolio management, or how the distribution of the mean is used effectively in control charts for monitoring industrial proceses, but then you must move on to the next chapter (usually sampling variance and the normal distribution), thereby erasing the point by piling on it totally different information. A first taste of statistics should be more pointed, more coherent, and more useful. Forget the details, focus on the big picture.

Bring on the revolution!