Wednesday, July 27, 2011

Analytics: You want to be in Asia

Business Intelligence and Data Mining have become hot buzzwords in the West. Using Google Insights for Search to "see what the world is searching for" (see image below), we can see that the popularity of these two terms seems to have stabilized (if you expand the search to 2007 or earlier, you will see the earlier peak and also that Data Mining was hotter for a while). Click on the image to get to the actual result, with which you can interact directly. There are two very interesting insights from this search result:
  1. Looking at the "Regional Interest" for these terms, we see that the #1 country searching for these terms is India! Hong Kong and Singapore are also in the top 5. A surge of interest in Asia!
  2. Adding two similar terms that have the term Analytics, namely Business Analytics and Data Analytics, unveils a growing interest in Analytics (whereas the two non-analytics terms have stabilized after their peak).
What to make of this? First, it means Analytics is hot. Business Analytics and Data Analytics encompass methods for analyzing data that add value to a business or any other organization. Analytics includes a wide range of data analysis methods, from visual analytics to descriptive and explanatory modeling, and predictive analytics. From statistical modeling, to interactive visualization (like the one shown here!), to machine-learning algorithms and more. Companies and organizations are hungry for methods that can turn their huge and growing amounts of data into actionable knowledge. And the hunger is most pressing in Asia.
Click on the image to refresh the Google Insight for Search result (in a new window)

Thursday, July 14, 2011

Designing an experiment on a spatial network: To Explain or To Predict?

Image from
http://www.slews.de
Spatial data are inherently important in environmental applications. An example is collecting data from air or water quality sensors. Such data collection mechanisms introduce dependence in the collected data due to their spatial proximity/distance. This dependence must be taken into account not only in the data analysis stage (and there is a good statistical literature on spatial data analysis methods), but also in the design of experiments stage. One example of a design question is where to locate the sensors and how many sensors are needed?

Where does explain vs. predict come into the picture? An interesting 2006 article by Dale Zimmerman called "Optimal network design for spatial prediction, covariance parameter estimation, and empirical prediction" tells the following story:
"...criteria for network design that emphasize the utility of the network for prediction (kriging) of unobserved responses assuming known spatial covariance parameters are contrasted with criteria that emphasize the estimation of the covariance parameters themselves. It is shown, via a series of related examples, that these two main design objectives are largely antithetical and thus lead to quite different “optimal” designs" 
(Here is the freely available technical report).

Monday, June 20, 2011

Got Data?!

The American Statistical Association's store used to sell cool T-shirts with the old-time beggar-statistician question "Got Data?" Today it is much easier to find data, thanks to the Internet. Dozens of student teams taking my data mining course have been able to find data from various sources on the Internet for their team projects. Yet, I often receive queries from colleagues in search of data for their students' projects. This is especially true for short courses, where students don't have sufficient time to search and gather data (which is highly educational in itself!).

One solution that I often offer is data from data mining competitions. KDD Cup is a classic, but there are lots of other data mining competitions that make huge amounts of real or realistic data available: past INFORMS Data Mining Contests (200820092010), ENBIS Challenges, and more. Here's one new competition to add to the list:

The European Network for Business and Industrial Statistics (ENBIS) announced the 2011 Challenge (in collaboration with SAS JMP). The title is "Maximising Click Through Rates on Banner Adverts: Predictive Modeling in the On Line World". It's a bit complicated to find the full problem description and data on the ENBIS website (you'll find yourself clicking-through endless "more" buttons - hopefully these are not data collected for the challenge!), so I linked them up.

It's time for T-shirts saying "Got Data! Want Knowledge?"

Friday, June 17, 2011

Scatter plots for large samples

While huge datasets have become ubiquitos in fields such as genomics, large datasets are now also becoming to infiltrate research in the social sciences. Data from eCommerce sites, online dating sites, etc. are now collected as part of research in information systems, marketing and related fields. We can now find social science research papers with hundreds of thousands of observations and more.

A common type of research question in such studies is about the relationship between two variables. For example, how does the final price of an online auction relate to the seller's feedback rating? A classic exploratory tool for examining such questions (before delving into formal data analysis) is the scatter plot. In small sample studies, scatter plots are used for exploring relationships and detecting outliers.

Image from http://prsdstudio.com/ 
With large samples, however, the scatter plot runs into a few problems. With lots of observations, there is likely to be too much overlap between markers on the scatter plot, even to the point of insufficient pixels to display all the points.

Here are some large-sample strategies to make scatter plots useful:

  1. Aggregation: display groups of observations in a certain area on the plot as a single marker. Size or color can denote the number of aggregated observations.
  2. Small-multiples: split the data into multiple scatter plots by breaking down the data into (meaningful) subsets. Breaking down the data by geographical location is one example. Make sure to use the same axis scales on all plots - this will be done automatically if your software allows "trellising".
  3. Sample: draw smaller random samples from the large dataset and plot them in multiple scatter plots (again, keep the axis scales identical on all plots).
  4. Zoom-in: examine particular areas of the scatter plot by zooming in
Finally, with large datasets it is useful to consider charts that are based on aggregation such as histograms and box plots. For more on visualization, see the Visualization chapter in Data Mining for Business Intelligence.

Friday, May 20, 2011

Nice April Fool's Day prank

The recent issue of the Journal of Computational Graphics & Statistics published a short article by Columbia Univ Prof Andrew Gelman (I believe he is the most active statistician-blogger) called "Why tables are really much better than graphs" based on his April 1, 2009 blog post (note the difference in publishing speed using blogs and refereed journals!). The last parts made me laugh hysterically - so let me share them:

About creating and reporting "good" tables:
It's also helpful in a table to have a minimum of four significant digits. A good choice is often to use the default provided by whatever software you have used to fit the model. Software designers have chosen their defaults for a good reason, and I'd go with that. Unnecessary rounding is risky; who knows what information might be lost in the foolish pursuit of a "clean"-looking table?
About creating and reporting "good" graphs:
If you must make a graph, try only to graph unadorned raw data, so that you are not implying you have anything you do not. And I recommend using Excel, which has some really nice defaults as well as options such as those 3-D colored bar charts. If you are going to have a graph, you might as well make it pretty. I recommend a separate color for each bar—and if you want to throw in a line as well, use a separate y-axis on the right side of the graph.
Note: please do not follow these instructions for creating tables and graphs! Remember, this is an April Fool's Day prank!
From Stephen Few's examples of bad visualizations (http://perceptualedge.com/examples.php)

Monday, April 25, 2011

Google Spreadsheets for teaching probability?

In business schools it is common to teach statistics courses using Microsoft Excel, due to its wide accessibility and the familiarity of business students with the software. There is a large debate regarding this practice, but at this point the reality is clear: the figure that I am familiar with is about 50% of basic stat courses in b-schools use Excel and 50% use statistical software such as Minitab or JMP.

Another trend is moving from offline software to "cloud computing" -- Software such as www.statcrunch.com offer basic stat functions in an online, collaborative, social-networky style.

Following the popularity of spreadsheet software and the cloud trend, I asked myself whether the free Google Spreadsheets can actually do the job. This is part of my endeavors to find free (or at least widely accessible) software for teaching basic concepts. While Google Spreadsheets does have quite an extensive function list, I discovered that its current computing is very limited. For example, computing binomial probabilities using the function BINOMDIST is limited to a sample size of about 130 (I did report this problem). Similarly, HYPGEOMDIST results in overflow errors for reasonably small sample and population sizes.


From the old days when we used to compute binomial probabilities manually, I am guessing that whoever programmed these functions forgot to use the tricks that avoid computing high factorials in n-choose-k type calculations...


Saturday, April 16, 2011

Moving Average chart in Excel: what is plotted?

In my recent book Practical Time Series Forecasting: A Practical Guide, I included an example of using Microsoft Excel's moving average plot to suppress monthly seasonality. This is done by creating a line plot of the series over time and then Add Trendline > Moving Average (see my post about suppressing seasonality). The purpose of adding the moving average trendline to a time plot is to better see a trend in the data, by suppressing seasonality.

A moving average with window width w means averaging across each set of w consecutive values. For visualizing a time series, we typically use a centered moving average with w = season.  In a centered moving average, the value of the moving average at time t (MAt) is computed by centering the window around time t and averaging across the w values within the window. For example, if we have daily data and we suspect a day-of-week effect, we can suppress it by a centered moving average with w=7, and then plotting the MA line.

An observant participant in my online course Forecasting discovered that Excel's moving average does not produce what we'd expect: Instead of averaging over a window that is centered around a time period of interest, it simply takes the average of the last w months (called a "trailing moving average"). While trailing moving averages are useful for forecasting, they are inferior for visualization, especially when the series has a trend. The reason is that the trailing moving average "lags behind". Look at the figure below, and you can see the difference between Excel's trailing moving average (black) and a centered moving average (red).



The fact that Excel produces a trailing moving average in the Trendline menu is quite disturbing and misleading. Even more disturbing is the documentation, which incorrectly describes the trailing MA that is produced:
"If Period is set to 2, for example, then the average of the first two data points is used as the first point in the moving average trendline. The average of the second and third data points is used as the second point in the trendline, and so on."
For more on moving averages, see here: