Tuesday, February 07, 2006

The "G" word

I use "G Shmueli" in my slides and in my email signature. This is not about that "G".

It usually surprises students when I say that most of the data analysis should be spent on data exploration rather than modeling. Whether it is for the sake of statistical testing, prediction of new records, or finding a model that helps understand the data structure, the most useful tool is GRAPHS and summaries. Data visualization is so important that in a sense, the models that follow will usually only confirm what we see.

A few points:
1. Good visualization tools are those that have high-quality graphics, are interactive, user-friendly, and can integrate many pieces of information. Excel is an example of a very low-level tool. It's graphs are usually very bad and require a lot of formatting (who needs a graph with gray background and horizontal lines???) A terrific tool which I discovered a few years ago is Spotfire. It is an interactive visualization tool that allows the user to browse the data from multiple point of view, using color, shape, size and more to visualize multidimentional data. When I show this tool, the class usually hisses "wowwwwwww"

2. Even when we're talking about huge datasets, visualization is still very useful. Of course if you try to create a scatterplot of income vs. age for a 1,000,000 customer database your screen will be black and perhaps your computer will freeze. The way to go is to sample from the database. A good random sample will give an adequation picture. You can also take a few other samples to verify that what you are seeing is consistent.

3. When deciding which plots to create, think about the goal of the analysis. For example, if we are trying to classify customers as buyers/non-buyers, we'd be interested in plots that compare the buyers to the non-buyers.

1 comment:

Curious Cat said...

Tufte has excellent books on graphic display including his latest Beautiful Evidence.