Tuesday, October 24, 2006

Webcasts on data mining

Moshe Cohen, a current MBA student in my class, pointed out an interesting set of webcasts on data mining called "Best Practices in Data Mining", by the Insightful corporation. Part I (now archived) describes a few scenarios where data mining is useful in the business context. They show some examples of questions of interest, datasets that are used in such applications, and the analysis process. Of course, their software InsightfulMiner is also showcased. I especially liked the emphasis on data visualization, and the SAS-EM-like "working" chart. They also discuss data preprocessing with some detail on missing values and outliers. Further topics include a bit on clustering, classification trees, and some evaluation methods.

Part II, upcoming on Nov 16, is supposed to cover additional topics. Here is the official description:
Part 2 of the series will explore further data modeling methods such as
survival methods. In addition we will discuss how the results of data
mining can be leveraged to understand underlying customer behavior, for
example. Such knowledge is then valuable in choosing appropriate business
actions, such as designing an optimal marketing campaign.

Saturday, October 14, 2006

It's competition season: and now Netflix

An exciting new dataset is out there for us data aficionados! Netflix, the huge movie renter, announced a $1 million prize for the winner of a competition who can improve upon their Cinematch algorithm for predicting movie ratings. The competition started at the beginning of the month and has already created a lot of buzz. The company put out there a huge training set that includes millions of movie ratings. Competing teams can use this dataset to come up with prediction algorithms, and then submit predictions for a test set.

The training dataset contain more than 100 million ratings from a random sample of 480,000 (unidentifiable) users on 18,000 movies.

The $1 million grand prize goes to the team that can reduce the RMSE of Cinematch by 10% on the test set. There are also modest $50,000 "progress prizes".

Putting aside the monetary incentive, and the goal of beating Cinamatch on the test set, this is a great dataset for research purposes. And Netflix has been generous enough to allow usage of the data for research purposes.

Another fun aspect is to read the posting on the forum! The various opinions, questions, and answers are a feast for anyone interested in online communities.

Saturday, October 07, 2006

Nation's favorite professors - in statistics???

When introductions are made, and the question comes "so what do you do?" I sheepishly reply "I teach statistics at University of Maryland's business school". The two most popular reactions are
(1) a terrified look -- "statistics? oh, I had to take that in undergrad!", or
(2) a dazed look -- "Wow!" [which really means, "I didn't understand any of it, so how did you figure it out?"]

But sometimes I do come across people who get all excited and say they took a statistics course and LOVED it. And very often it is attributable to the professor. Indeed, from my own school experience I found that statistics can be taught in extremely different ways: boring, scary, and vague, or exciting, useful, and challenging!

Our very own Professor Erich Studer-Ellis, who teaches the core statistics undergraduate classes at the business school, has just been named by BusinessWeek as one of the nation's favorite undergraduate business school professors. Yes -- this is possible! And this is in spite the huge class sizes that he teaches (typically in the hundreds). In fact, among our statistics professors, a majority have received teaching awards. But more importantly, when I meet their students, their eyes glow when they hear "statistics".

The bottom line is, therefore, don't create an impression of statistics based on a sample of n=1 course, unless that impression happens to be positive.

Friday, October 06, 2006

Time Series Forecasting Competition

Forecasting transportation demand is important for multiple goals such as staffing, planning, and inventory control. The public transportation system in Santiago de Chile is currently going through a major effort of reconstruction (if you read Spanish, you can find more at www.transantiago.cl).

The 2006 Business Intelligence Competition (BI CUP 2006) focuses on forecasting demand for public transportation. They provide a training set of a time series of passengers arriving at a terminal, and the competitors must come up with a method for forecasting the test set, which comprises of a few future days.

Although this problem is a great example of data mining for business intelligence, the winning criterion is the model that generates the smallest mean absolute error (MAE), also known as mean absolute deviation (MAD). In other words, the closeness of the forecasts to the actual values in the test set is the only criterion for winning. This setup makes this more of a pure data mining problem, and much less one of business intelligence. Clearly, the most accurate model might be completely impractical. For example, it might be very computationally intensive, when in practice the model is supposed to produce real-time forecasts for many different series simultaneously. Or, there might be different costs associated with forecast errors at different times of the day (or difference days of the week). These types of considerations, when included in the modeling phase, turn the data mining task into a business related one.

The competition organizers promised to provide more of the business details after the competition ends, at the end of October.

5 teams of MBAs in my data mining class are now working hard on this forecasting problem. A few might decide to formally participate in the competition.

Good luck to the competitors!