Thursday, July 19, 2007

Handling outliers with a smile

Here's one of the funniest statistics cartoons that I've seen (thanks Adi Gadwale!) First you laugh, then you cry.


Also reminds me of the claim by the famous industrial statistician George Box "All models are wrong, but some are useful".

Wednesday, July 11, 2007

The Riverplot: Visualizing distributions over time

The boxplot is one of the neatest visualizations for examining the distribution of values, or for comparing distribtions. It is more compact than a histogram in that it only presents the median, the two quartiles, the range of the data, and outliers. It also requires less user input than a histogram (where the user usually has to determine the number of bins). I view the boxplot and histogram as complements, and examining both is good practice.

But how can you visualize a distribution of values over time? Well, a series of boxplots often does the trick. But if the frequency is very high (e.g., ticker data) and the time scale of interest can be considered continuous, then an alternative is the River Plot. This is a visualization that we developed together with our colleagues at the Human Computer Interaction Lab on campus. It is essentiall a "continuous boxplot" that displays the median and quartiles (and potentially the range or other statistics). It is suitable when you have multiple time series that can be considered replicates (e.g., bid in multiple eBay auctions for an iPhone). We implemented it in the interactive time series visualization tool called Time Searcher, which allows to visualize and interactively explore a large set of time series with attributes.

Time Searcher is a powerful tool and allows the user to search for patterns, filter, and also to forecast an ongoing time series from its past and a historic database of similar time series. But then the Starbucks effect of too many choices kicks in. Together with our colleague Paolo Buono from Universita de Bari, Italy, we added the feature of "simultaneous previews": the user can choose multiple different parameter setting and view the resulting forecasts simultaneously. This was presented in the most recent InfoVis conference (Similarity-Based Forecasting with Simultaneous Previews: A River Plot Interface for Time Series Forecasting).