Wednesday, January 27, 2010

Drag-and-drop data mining software for the classroom

The drag-and-drop (D&D) concept in data mining tools is very neat. You "drag" icons (aka "nodes") that do different operations, and "connect" them to create a data mining process. This is also called "graphical programming". What I especially like about it is that it keeps the big picture in your mind rather than getting blinded by analysis details. The end product is also much easier to present and document.

There has been quite a bonanza lately with a few of the major D&D data mining software tools. Clementine (by SPSS - now IBM) is now called "IBM SPSS Modeler". Insightful Miner (by Insightful - now TIBCO) is now TIBCO Spotfire Miner. SAS Enterprise Miner remains SAS EM. And STATISTICA Data Miner by StatSoft also remains in the same hands.

There's a good comparison of these four tools (and two more non-d&d, menu driven tools: KXEN and XLMiner) on InformationManagement.com. The 2006 article by Nisbet compares performance, pricing, and more.

Let me look at the choice of a D&D package from the perspective of a professor teaching a data mining course in a business school. My own considerations are: (1) easy and fast to learn, (2) easy for my students to access, (3) cheap enough for our school to purchase, and (4) reasonably priced for students after they graduate. It's also nice to have good support (when things break down or when you just can't figure something out). And some instructors also like additional teaching materials.

I've had the longest experience with SAS EM, but it has been a struggle. At first we had individual student licenses, where each student had to download the software from a set of CDs that I had to circulate between them. The size of the software choked too many computers. So we moved to the server version (that allows students to use the software through our portal), but that has been excruciatingly slow. The server version is also quite expensive to the school. The potential solution was to move to using the "SAS on demand" product, where the software is accessed online and sits on the SAS servers. SAS offers this through the SAS on demand for Academics (SODA) program and it is faster. However, as I ranted in another post, SODA currently can only load SAS datasets. And finally, SAS EM is extremely expensive outside of academia. The likelihood that my students would have access to it in their post-graduation job was therefore low.

I recently discovered Spotfire Miner (by TIBCO) and played around with it. Very fast and easy to learn, runs fast, and happily accepts a wide range of data file types. Cost for industry is currently $349/month. For use in the classroom it is free to both instructor and students! (as part of TIBCO's University Program).

I can't say much about IBM SPSS Modeler (previously known as Clementine) or StatSoft's STATISTICA Data Miner, except that after looking thoroughly through their websites I couldn't find any mention of pricing for academia or for industry. And I usually don't like the "request a quote" which tends to leave my mailbox full of promotional materials forever (probably the result of a data mining algorithm used for direct marketing!). Is the academic version identical to the full-blown version? is it a standalone installation or do you install it on a server?

For instructors who like extra materials: SAS offers a wealth of data mining teaching materials (you must contact them to receive the materials). StatSoft has a nice series of YouTube videos on different data mining topics and a brief PDF tutorial on data mining (they also have the awesome free Electronic Statistics Textbook which is a bit like an encyclopedia). I don't know of data mining teaching materials for the other packages (and couldn't find any on their websites).

It would be great to hear from other instructors and MBA students about their classroom (and post-graduation) experience with D&D software.

3 comments:

Prem Swaroop said...

We used Spotfire Miner for the class project, along side XLMiner. Here are some comments about my experience, after a bit of background.

Our data set had high class imbalance – target score had only 6.6% records of the total. We had created stratified partitions for training, validation, and test sets – so that each team member could work individually on the data mining tasks, and finally report the classification performance on the same test set. We believed that due to the class imbalance, random samples could skew our results rendering them incomparable, hence this decision.

XLMiner was used as instructional software for the class, and Spotfire Miner was encouraged in the class especially for tasks beyond what XLMiner could do.

* XLMiner's limitation on 10,000 rows for training was the key reason for us to explore Spotfire Miner, which does not have this limitation.

* Spotfire Miner's drag and drop indeed is a great tool to keep one focused on the big picture, at the same time simplifying many of the common tasks. It also makes for great presentation.

* Spotfire Miner does not provide easy manipulations possible with XLMiner, especially so when working on the coefficients directly. This could be required, eg, when computing scores with asymmetric misclassification costs and oversampling. Of course it's not impossible, just needs a bit more work.

* Discriminant Analysis in Spotfire Miner is sorely missed.

* Both the tools do not provide for specifying a field for specifying partitions. In Spotfire Miner, we used the Filter Rows tool to work around this hurdle. As it provides for easily combining a different data set for predicting once a model is built, it was quite easy to execute.

Prem Swaroop said...

We used Spotfire Miner for the class project, along side XLMiner. Here are some comments about my experience, after a bit of background.

Our data set had high class imbalance – target score had only 6.6% records of the total. We had created stratified partitions for training, validation, and test sets – so that each team member could work individually on the data mining tasks, and finally report the classification performance on the same test set. We believed that due to the class imbalance, random samples could skew our results rendering them incomparable, hence this decision.

XLMiner was used as instructional software for the class, and Spotfire Miner was encouraged in the class especially for tasks beyond what XLMiner could do.

* XLMiner's limitation on 10,000 rows for training was the key reason for us to explore Spotfire Miner, which does not have this limitation.

* Spotfire Miner's drag and drop indeed is a great tool to keep one focused on the big picture, at the same time simplifying many of the common tasks. It also makes for great presentation.

* Spotfire Miner does not provide easy manipulations possible with XLMiner, especially so when working on the coefficients directly. This could be required, eg, when computing scores with asymmetric misclassification costs and oversampling. Not impossible, just a bit more work.

* Discriminant Analysis in Spotfire Miner is sorely missed.

* Both the tools do not provide for specifying a field for specifying partitions. In Spotfire Miner, we used the Filter Rows tool to work around this hurdle. As it provides for easily combining a different data set for predicting once a model is built, it was quite easy to execute.

Galit Shmueli said...

Prem - thanks for sharing your experience! Regarding the data partitioning, do you mean that you want to specify which records go into training, validation, and test? If so, then in XLMiner you can do that by specifying a new column with 't'. 'v' and 's' marking which set it should belong to. Then choose this column in the "use partition variable" in the data partitioning menu.