Friday, March 31, 2006

Data mining courses in MBA programs

Data mining and advanced data analysis courses are becoming more and more popular in B-schools. I asked colleagues about such electives in their universities, and also did some web- searching. It looks like different courses range in topics and flavor, depending in many times on the instructor's field of expertise (statistics, operations research, information systems, marketing, machine learning, etc). But all courses typically revolve around real business applications.

Here's an initial list that I've composed (alphabetical in School name). I hope that others who teach or are taking such courses will add to the list. Links to syllabuses are also welcome. This can be a gerat resource for instructors (sharing information, material, experiences, etc.) as well as for MBA students!

Business Schools

Fuqua School of Business, Duke: Data Mining (MGRECON 491)

Indian School of Business (ISB): Business Intelligence Using Data Mining (update by R Bapna)

Kelley School of Business, Indiana University: Data Mining (K513) Syllabus

Lundquist College of Business, U of Oregon: Information Analysis for Managerial Decisions (DSC 433/533) syllabus

McCombs School of Business, UTexas at Austin: Data Mining (MIS 382N.9)

Robert H Smith School of Business, U of MD College Park: Data Analysis for Decision Makers (BUDT733) syllabus

Sloan School of Management, MIT: Data Mining (15.062)

Stern School of Business, NYU: Data Mining and Knowledge Systems (B20.3336)

Tepper School of Business, CMU: Mining Data for Decision Making (45-863)

Wharton School, UPenn: Decision Support Systems (OPIM-410-/672)

Online courses:

Cardean Online University: Data Mining (EIMA 714) online courses: Intro to Data Mining and Data Mining 2

Thursday, March 30, 2006

A cool book

One of my favorite statistics books that just makes you want to learn data analysis is Freakonomics : A Rogue Economist Explores the Hidden Side of Everything by Steven Levitt - an economist from the University of Chicago who was recently awarded the John Bates Clark Medal, and Stephen Dubner - an author and writer for the NYT and The New Yorker. Together they created a rare product: a data analysis books to take to bed!

The book describes several somewhat-wild studies in an attempt to answer questions that your kids might have asked you: "Why do drug dealers still live with their moms?" or "What do schoolteachers and Sumo wrestlers have in common?". The beauty of these studies is the creativity in the question of interest, sometimes the data collection or experimental design, structuring the analysis, and the interpretation of the results. It shows how careful modeling can lead to very interesting (if somewhat untraditional) insights.

It is an easy, fun read. And for those who know something about regression models, that is the main tool used. Recommending this to students at the start of the data analysis class usually brings a few back with starry eyes.

Turns out that the book is actually used in a variety of university courses and there is even a free study guide!

And of course, if you get totally hooked, there is a freakonomics blog.

Tuesday, March 28, 2006

Stairway to Heaven

"There's a lady who's sure all that glitters is gold
And she's buying a stairway to heaven
And when she gets there she knows if the stores are closed
With a word she can get what she came for"

Led Zeppelin were on to something -- The Mar-27 BusinessWeek cover story includes a report on BestBuy who'se trying to escape commodity hell via customer segmentation: "The company has divided its customers into five distinct demographic groups and is doing extensive market research to figure out how to serve them better"

On their website, a press release gives further details:
"As part of its customer centricity initiative, Best Buy identified five initial customer segments that it believes represent significant new growth opportunities or include some of the company's most profitable customers today. The segments include:
  1. The affluent professional who wants the best technology and entertainment experience and who demands excellent service.
  2. The focused, active, younger male customer who wants the latest technology and entertainment.
  3. The family man who wants technology that improves his life - the practical
    adopter of technology and entertainment.
  4. The busy suburban mom who wants to enrich her children's lives with technology and entertainment.
  5. The small business customer who can use Best Buy's product solutions and services to enhance the profitability of his or her business. "

The million $ question is how did they arrive at these segments? What does "initial customer segments" mean? why "believe"? Why exactly 5? Is there a cluster analysis going on behind the scenes, or is it time for one?

Monday, March 20, 2006

Data mining for prediction vs. explanation

A colleague of mine and I have an ongoing discussion about what data mining is about. In particular, he claims that data mining focuses on predictive tasks, whereas I see the use of data mining methods for either prediction or explanation.

Predictive vs. Explanatory Tasks
The distinction between predictive and explanatory tasks is not always easy: Of course, in both cases the goal is "future actionable results". In general, an explanatory goal (also called characterizing or profiling) is when we're interested in learning about factors that affect a certain outcome. In contrast, a predictive goal (also called classification, when we have a categorical response) is when we care less about "insights" and more about accurate predictions of future observations.

In some cases it is very easy to identify an explanatory task, because it simply doesn't make any sense to predict the outcome. For example, we might be looking at performance measures of male and female CEOs, in an attempt to understand the differences (in terms of performance) between the genders. In this case, we would most likely not try to predict the gender of a new observation based on their performance measures...

In most cases, however, it is harder to decide what the task is. This is especially because of the ambiguity of problem descriptions. People will say "I'd like you to use these data to see if we can understand the factors that lead to defaulting on a loan, so that we can predict for new applicants their chance to default". Since it is detrimental for the analysis to know the type of task (explanatory vs. predictive) is requires, I believe the only way out is to try and squeeze it out of the domain person by questions such as "what do you plan to do with the analysis results?" or "what if I found that..."

Here's an example: A marketing manager at a cable company hires your services to analyze their customer data in order "to know the household-characteristics that distinguish subscribers to premium channels (such as HBO) from non-subscribers." Is this a predictive goal? explanatory? It depends! Here are possible scenarios of what they will use the analysis for:
1. To market the premium channel plan in a new market only to people who are more likely to subscribe (predictive)
2. To re-negotiate the cable company's contract with HBO based on what they find (explanatory) -- this scenario is courtesy of a current MBA student

How will the analysis differ?
Whether the goal is predictive or explanatory, the analysis process will take different avenues. This includes what performance measures are devised and used (e.g., to partition or not to partition? use an R-squared or a MAPE?), what types of methods are employed (a k-nearest neighbor won't have much explanatory power), etc.

Thursday, March 09, 2006

Impact of weather on the economy

The March 6 issue of BusinessWeek reports an increase in retail sales and housing starts this January (What Got The Economy's Bouce Going). The hypothesis is that the cause is the relative warm weather (an average of 39.6F). So how can this be tested? BW mention a study by James O'sullivan, an economist with UBS, that tries to quantify the impact of warm January weather on retail sales and housing starts. They describe it as follows:

He looked at the historical relationship between data from several economic reports and deviations from the average December and January temperatures. December temperatures were used to capture any weather-related distortions that could carry over into January.

This doesn't tell us much about the model. The clue is in the reported results:

Based on these past relationships, the above-average January temperatures provided a 1.4-percentage-point boost to retail sales. Housing starts got a weather-related increase of approximately 200,000 units at an annualized rate, while the balmy temperatures may have accounted for all of the 0.7% rise in manufacturing output.

Here's my guess: this is a set of regression models! For retail sales, it might look like this:

log(retail sales) = a + b*(deviation from average Dec-or-Jan temp)+ noise
The model for housing starts:
housing starts = a + b*(deviation from average Dec-or-Jan temp)+ noise

The model for manufacturing output:

log(housing starts) = a + b*(deviation from average Dec-or-Jan temp)+ noise
How do we know whether there is a log on the left hand side? Well, if the results are reported in percentages (a unit change in X is associated with a percentage change in Y), then there is most likely a log. In contrast, if the results are reported in the units of Y (for instance, units of housing starts), then there is no log.
A final interesting note made in the article is that "government's seasonal adjustment process, which tries to account for typical seasonal variation, can go awry when patterns are atypical". I'll describe how these seasonal adjustments are done in a future post.

Thursday, March 02, 2006

What jobs does a good data mining course land you?

I always love to hear from former students about the great jobs that they landed after graduating from the MBA program. It is especially great (and I must admit - surprising) when the data mining course they took at the Smith School was instrumental is getting that job.

Here's one of the more unusual stories (quoting from a former student's email):

"Just a couple weeks before completing the requirements of the program, thanks in no small part to what I learned [in the data mining course] ... I have accepted an offer for a dream job in the entertainment industry. Based on the analytical skills I cultivated through lessons on data modeling and analysis, and the analytical tools and techniques ... I have been selected by my company for a newly created position designed to help the nation’s largest concert promoter figure out how to optimize its booking strategy for major national tours. It’s about supplementing the intuition of industry experts and veterans with analytical insight, and introducing more method to the madness that characterizes the business side of Rock and Roll. Who said math and statistics isn’t cool?"

Any others out there with a cool story? Please post!