Wednesday, January 27, 2010

Drag-and-drop data mining software for the classroom

The drag-and-drop (D&D) concept in data mining tools is very neat. You "drag" icons (aka "nodes") that do different operations, and "connect" them to create a data mining process. This is also called "graphical programming". What I especially like about it is that it keeps the big picture in your mind rather than getting blinded by analysis details. The end product is also much easier to present and document.

There has been quite a bonanza lately with a few of the major D&D data mining software tools. Clementine (by SPSS - now IBM) is now called "IBM SPSS Modeler". Insightful Miner (by Insightful - now TIBCO) is now TIBCO Spotfire Miner. SAS Enterprise Miner remains SAS EM. And STATISTICA Data Miner by StatSoft also remains in the same hands.

There's a good comparison of these four tools (and two more non-d&d, menu driven tools: KXEN and XLMiner) on The 2006 article by Nisbet compares performance, pricing, and more.

Let me look at the choice of a D&D package from the perspective of a professor teaching a data mining course in a business school. My own considerations are: (1) easy and fast to learn, (2) easy for my students to access, (3) cheap enough for our school to purchase, and (4) reasonably priced for students after they graduate. It's also nice to have good support (when things break down or when you just can't figure something out). And some instructors also like additional teaching materials.

I've had the longest experience with SAS EM, but it has been a struggle. At first we had individual student licenses, where each student had to download the software from a set of CDs that I had to circulate between them. The size of the software choked too many computers. So we moved to the server version (that allows students to use the software through our portal), but that has been excruciatingly slow. The server version is also quite expensive to the school. The potential solution was to move to using the "SAS on demand" product, where the software is accessed online and sits on the SAS servers. SAS offers this through the SAS on demand for Academics (SODA) program and it is faster. However, as I ranted in another post, SODA currently can only load SAS datasets. And finally, SAS EM is extremely expensive outside of academia. The likelihood that my students would have access to it in their post-graduation job was therefore low.

I recently discovered Spotfire Miner (by TIBCO) and played around with it. Very fast and easy to learn, runs fast, and happily accepts a wide range of data file types. Cost for industry is currently $349/month. For use in the classroom it is free to both instructor and students! (as part of TIBCO's University Program).

I can't say much about IBM SPSS Modeler (previously known as Clementine) or StatSoft's STATISTICA Data Miner, except that after looking thoroughly through their websites I couldn't find any mention of pricing for academia or for industry. And I usually don't like the "request a quote" which tends to leave my mailbox full of promotional materials forever (probably the result of a data mining algorithm used for direct marketing!). Is the academic version identical to the full-blown version? is it a standalone installation or do you install it on a server?

For instructors who like extra materials: SAS offers a wealth of data mining teaching materials (you must contact them to receive the materials). StatSoft has a nice series of YouTube videos on different data mining topics and a brief PDF tutorial on data mining (they also have the awesome free Electronic Statistics Textbook which is a bit like an encyclopedia). I don't know of data mining teaching materials for the other packages (and couldn't find any on their websites).

It would be great to hear from other instructors and MBA students about their classroom (and post-graduation) experience with D&D software.

Wednesday, January 06, 2010

Creating map charts

With the growing amount of available geographical data, it is useful to be able to visualize one's data on top of a map. Visualizing numeric and/or categorical information on top of a map is called a map chart.

Two student teams in my Fall data mining class explored and displayed their data on map charts: one team compared economic, political, and well-being measures across different countries in the world. By linking a world map to their data, they could use color (hue and shading) to compare countries and geographical areas on those measures. Here's an example of two maps that they used. The top map uses shading to denote the average "well-being" score of a country (according to a 2004 Gallup poll), and the bottom map uses shading to denote the country's GDP. In both maps darker means higher.

Another team used a map to compare nursing homes in the US, in terms of quality of care scores. Their map below show the average quality of nursing home in each US State (darker means higher quality).These two sets of maps were created using TIBCO Spotfire. Following many requests, here is an explanation of how to create a map chart in Spotfire. Once you have your ordinary data file open, there are 3 steps to add the map component:
  1. Obtain the threesome of "shapefiles" needed to plot the map of interest: .shp file, .dbf file, and .shx file (see Wikipedia for an explanation of each)
  2. Open the shapefile in Spotfire (Open>New Visualization> Map Chart, then upload the shp file in Map Chart Properties> Data tab> Map data table)
  3. Link the map table to your data table using the Map Chart Properties> Data tab > Related data table for coloring (you will need a unique identifier linking your data table with the map table)
The tricky part is obtaining shapefiles. One good source with free files is Blue Marble Geographics (thanks to Dan Curtis for this tip!). For US state and county data, shapefiles can be obtained from the US Census Bureau website (thanks to Ben Meadema for this one!) I'm still in search for more sources (for Europe and Asia, for instance).

I thank Smith MBA students Dan Curtis, Erica Eisenhart, John Geraghty and Ben Meadema for their contributions to this post.