Monday, April 25, 2011

Google Spreadsheets for teaching probability?

In business schools it is common to teach statistics courses using Microsoft Excel, due to its wide accessibility and the familiarity of business students with the software. There is a large debate regarding this practice, but at this point the reality is clear: the figure that I am familiar with is about 50% of basic stat courses in b-schools use Excel and 50% use statistical software such as Minitab or JMP.

Another trend is moving from offline software to "cloud computing" -- Software such as www.statcrunch.com offer basic stat functions in an online, collaborative, social-networky style.

Following the popularity of spreadsheet software and the cloud trend, I asked myself whether the free Google Spreadsheets can actually do the job. This is part of my endeavors to find free (or at least widely accessible) software for teaching basic concepts. While Google Spreadsheets does have quite an extensive function list, I discovered that its current computing is very limited. For example, computing binomial probabilities using the function BINOMDIST is limited to a sample size of about 130 (I did report this problem). Similarly, HYPGEOMDIST results in overflow errors for reasonably small sample and population sizes.


From the old days when we used to compute binomial probabilities manually, I am guessing that whoever programmed these functions forgot to use the tricks that avoid computing high factorials in n-choose-k type calculations...


Saturday, April 16, 2011

Moving Average chart in Excel: what is plotted?

In my recent book Practical Time Series Forecasting: A Practical Guide, I included an example of using Microsoft Excel's moving average plot to suppress monthly seasonality. This is done by creating a line plot of the series over time and then Add Trendline > Moving Average (see my post about suppressing seasonality). The purpose of adding the moving average trendline to a time plot is to better see a trend in the data, by suppressing seasonality.

A moving average with window width w means averaging across each set of w consecutive values. For visualizing a time series, we typically use a centered moving average with w = season.  In a centered moving average, the value of the moving average at time t (MAt) is computed by centering the window around time t and averaging across the w values within the window. For example, if we have daily data and we suspect a day-of-week effect, we can suppress it by a centered moving average with w=7, and then plotting the MA line.

An observant participant in my online course Forecasting discovered that Excel's moving average does not produce what we'd expect: Instead of averaging over a window that is centered around a time period of interest, it simply takes the average of the last w months (called a "trailing moving average"). While trailing moving averages are useful for forecasting, they are inferior for visualization, especially when the series has a trend. The reason is that the trailing moving average "lags behind". Look at the figure below, and you can see the difference between Excel's trailing moving average (black) and a centered moving average (red).



The fact that Excel produces a trailing moving average in the Trendline menu is quite disturbing and misleading. Even more disturbing is the documentation, which incorrectly describes the trailing MA that is produced:
"If Period is set to 2, for example, then the average of the first two data points is used as the first point in the moving average trendline. The average of the second and third data points is used as the second point in the trendline, and so on."
For more on moving averages, see here:

Saturday, April 09, 2011

Visualizing time series: suppressing one pattern to enhance another pattern

Visualizing a time series is an essential step in exploring its behavior. Statisticians think of a time series as a combination of four components: trend, seasonality, level and noise. All real-world series contain a level and noise, but not necessarily a trend and/or seasonality. It is important to determine whether trend and/or seasonality exist in a series in order to choose appropriate models and methods for descriptive or forecasting purposes. Hence, looking at a time plot,  typical questions include:
  • is there a trend? if so, what type of function can approximate it? (linear, exponential, etc.) is the trend fixed throughout the period or does it change over time? 
  • is there seasonal behavior? if so, is seasonality additive or multiplicative? does seasonal behavior change over time?
Exploring such questions using time plots (line plots of the series over time) is enhanced by suppressing one type of pattern for better visualizing other patterns. For example, suppressing seasonality can make a trend more visible. Similarly, suppressing a trend can help see seasonal behavior. How do we suppress seasonality? Suppose that we have monthly data and there is apparent annual seasonality. To suppress seasonality (also called seasonal adjustment), we can
  1. Plot annual data (either annual averages or sums)
  2. Plot a moving average (an average over a window of 12 months centered around each particular month)
  3. Plot 12 separate series, one for each month (e.g., one series for January, another for February and so on)
  4. Fit a model that captures monthly seasonality (e.g., a regression model with 11 monthly dummies) and look at the residual series
An example is shown in the Figure. The top left plot is the original series (showing monthly ridership on Amtrak trains). The bottom left panel shown a moving average line, suppressing seasonality and showing the trend. The top right panel shows a model that captures the seasonality. The lower left panel shows the residuals from the model, again enhancing the trend.

For further details and examples, see my recently published book Practical Time Series Forecasting: A Hands On Guide (available in soft-cover and as an eBook).

Thursday, December 23, 2010

No correlation -> no causation?

Applied Multiple Regression/Correlation Analysis for the Behavioral SciencesI found an interesting variation on the "correlation does not imply causation" mantra in the book Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences by Cohen et al. (apparently one of the statistics bibles in behavioral sciences). The quote (p.7) looks like this:
Correlation does not prove causation; however, the absence of correlation implies the absence of the existence of a causal relationship
Let's let the first part rest in peace. At first glance, the second part seems logical: you find no correlation, then how can there be causation? However, after further pondering I reached the conclusion that this logic is flawed, and that one might observe no correlation when in fact there exists underlying causation. The reason is that causality is typically discussed at the conceptual level while correlation is computed at the measurable data level.

Where is Waldo?
Consider an example where causality is hypothesized at an unmeasurable conceptual level, such as "higher creativity leads to more satisfaction in life". Computing the correlation between "creativity" and "satisfaction" requires operationalizing these concepts into measurable variables, that is, identifying measurable variables that adequately represent these underlying concepts. For example, answers to survey questions regarding satisfaction in life might be used to operationalize "satisfaction", while a Rorschach test might be used to measure "creativity". This process of operationalization obviously does not lead to perfect measures, not to mention that data quality can be sufficiently low to produce no correlation even if there exists an underlying causal relationship.

In short, the absence of correlation can also imply that the underlying concepts are hard to measure, are inadequately measured, or that the quality of the measured data is too low (i.e., too noisy) for discovering a causal underlying relationship.

Monday, December 13, 2010

Discovering moderated relationship in the era of large samples

I am currently visiting the Indian School of Business (ISB) and enjoying their excellent library. As in my student days, I roam the bookshelves and discover books on topics that I know little, some, or a lot. Reading and leafing through a variety of books, especially across different disciplines, gives some serious points for thought.

As a statistician I have the urge to see how statistics is taught and used in other disciplines. I discovered an interesting book coming from the psychology literature by Herman Aguinas called Regression Analysis for Categorical Moderators. "Moderators" in statistician language is "interactions". However, when social scientists talk about moderated relationships or moderator variables, there is no symmetry between the two variables that create the interaction. For example if X1=education level, X2=Gender, and Y=Satisfaction at work, then an inclusion of the moderator X1*X2 would follow a direct hypothesis such as "education level affects satisfaction at work differently for women and for men."

Now to the interesting point: Aguinis stresses the scientific importance of discovering moderated relationships and opens the book with the quote:
"If we want to know how well we are doing in the biological, psychological, and social sciences, an index that will serve us well is how far we have advanced in our understanding of the moderator variables of our field."        --Hall & Rosenthal, 1991
Discovering moderators is important for understanding the bounds of generalizability as well as for leading to adequate policy recommendations. Yet, it turns out that "Moderator variables are difficult to detect even when the moderator test is the focal issue in a research study and a researcher has designed the study specifically with the moderator test in mind."

One main factor limiting the ability to detect moderated relationships (which tend to have small effects) is statistical power. Aguinas describes simulation studies showing this:
a small effect size was typically undetected when sample size was as large as 120, and ...unless a sample size of at least 120 was used, even ... medium and large moderating effects were, in general, also undetected.
This is bad news. But here is the good news: today, even researchers in the social sciences have access to much larger datasets! Clearly n=120 is in the past. Since this book has come out in 2004, have there been large-sample studies of moderated relationships in the social sciences?

I guess that's where searching electronic journals is the way to go...

Tuesday, November 16, 2010

November Analytics magazine on BI

click to read the latest issue
A bunch of interesting articles about business analytics and predictive analytics from a managerial point of view, in the November issue of INFORMS Analytics magazine.

Sunday, November 14, 2010

Data visualization in the media: Interesting video

A colleague who knows my fascination with data visualization pointed me to a recent interesting video created by Geoff McGhee on Journalism in the Age of Data. In this 8-part video, he interviews media people who create visualizations for their websites at the New York Times, Washington Post, CNBC, and more. It is interesting to see their view of why interactive visualization might be useful to their audience, and how it is linked to "good journalism".

Also interviewed are a few visualization interface developers (e.g., IBM's Many Eyes designers) as well as Infographics experts and participants at the major Inforgraphics conference in Pamplona, Spain. The line between beautiful visualizations (art) and effective ones is discussed in Part IV ("too sexy for its own good" - Gert Nielsen) - see also John Grimwade's article.


Journalism in the Age of Data from Geoff McGhee on Vimeo.

The videos can be downloaded as a series of 8 podcasts, for those with narrower bandwidth.