Last week Lori Rothenberg from SAS Higher Education visited our MBA class. She gave a 1-hour tutorial on SAS Enterprise Miner, which is the data mining software package by SAS. This is a pretty powerful tool, especially when dealing with large datasets. One of the nicest features of SAS EM is the "workspace", which displays a diagram of the entire modeling process, from data specification, through data manipulation, modeling, evaluation, and scoring.
Aside from software training, Lori described several real data mining projects, and how they were able to add value to the businesses. This further supports the course effort to emphasize the power of data mining in the business intelligence context.
The SAS Higher Education group offers tutorials, workshops and other resources (e.g., summer programs) to instructors - most of these are free! We've had a great experience with them.
Thursday, November 30, 2006
Saturday, November 11, 2006
p-values do bite
I've discussed the uselessness of p-values in very large samples, where even miniscule effects become magnified. This is known as the divergence between practical significance and statistical significance.
An interesting article in the most recent issue of The American Statistician describes another dangerous pitfall in using p-values. In their article The Difference Between "Significant" and "Not Significant" is not Itself Statistically Significant, Andrew Gelman (a serious blogger himself!) and Hal Stern warn that the comparison of p-values to one another for the purpose of discerning a difference between the corresponding effects (or parameters) is erroneous.
Consider, for example, fitting the following regression model to data:
Sales = beta0 + beta1 TVAdvertising + beta2 WebAdvertising
(say, Sales are in thousands of $, and advertising is in $. )
Let's assume that we get the following coefficient table:
Coef (std err) p-value
TVAds 3 (1) 0.003
WebAds 1 (1) 0.317
We would reach the conclusion (at, say, a 5% significance level) that TVAds contribute significantly to sales revenue (after accounting for WebAds), and that WebAds do not contribute significantly to sales (after accounting for TVAds). Could we therefore conclude from these two opposite significance conclusions that the difference between the effects of TVAds and WebAds is significant? The answer is NO!
To compare the effects of TVads directly to WebAds, we would use the statistic:
T = (3-1) / (1^2 + 1^2) = 1
The p-value for this statistics is 0.317, which indicates that the difference between the coefficients of TVAds and WebAds is not statistically significant (at the same 5% level).
The authors give two more empirical examples that illustrate this phenomenon. There is no real solution rather than to keep this anomaly in mind!
An interesting article in the most recent issue of The American Statistician describes another dangerous pitfall in using p-values. In their article The Difference Between "Significant" and "Not Significant" is not Itself Statistically Significant, Andrew Gelman (a serious blogger himself!) and Hal Stern warn that the comparison of p-values to one another for the purpose of discerning a difference between the corresponding effects (or parameters) is erroneous.
Consider, for example, fitting the following regression model to data:
Sales = beta0 + beta1 TVAdvertising + beta2 WebAdvertising
(say, Sales are in thousands of $, and advertising is in $. )
Let's assume that we get the following coefficient table:
Coef (std err) p-value
TVAds 3 (1) 0.003
WebAds 1 (1) 0.317
We would reach the conclusion (at, say, a 5% significance level) that TVAds contribute significantly to sales revenue (after accounting for WebAds), and that WebAds do not contribute significantly to sales (after accounting for TVAds). Could we therefore conclude from these two opposite significance conclusions that the difference between the effects of TVAds and WebAds is significant? The answer is NO!
To compare the effects of TVads directly to WebAds, we would use the statistic:
T = (3-1) / (1^2 + 1^2) = 1
The p-value for this statistics is 0.317, which indicates that the difference between the coefficients of TVAds and WebAds is not statistically significant (at the same 5% level).
The authors give two more empirical examples that illustrate this phenomenon. There is no real solution rather than to keep this anomaly in mind!
Thursday, November 02, 2006
"Direct Marketing" to capture voters
OK, I admit it - I did peak over the shoulder of my fellow Metro rider last night (while returning from teaching Classification Trees), to better see her Wall Street Journal's front page. The article that caught my eye was "Democracts, Playing Catch-Up, Tap Database to Woo Potential Voters". I only managed to catch the first few paragraphs before the newspaper owner flipped to the next page.
Luckily, my student Michael Melcer just emailed me the complete article. He put it very nicely:
So what exactly are we reading about? Apparently a direct marketing application used to capture people who are most likely to vote (democratic). "The technique aims to identify potential supporters by collecting and analyzing the unprecedented amount of information now readily available -- from census data to credit-card bills -- to profile individual voters."
In classic direct marketing, companies use information such as demographics and the historical relationship of the customer with the company (e.g., number of purchases, dates and amounts of purchases) as predictor variables. They use data from a pilot study or previous campaigns to collect information on the outcome variable of interest - did the customer respond to the marketing effort? (e.g., respond to a credit card solicitaion). Combining the predictor and outcome information, a model is created that predicts the probability of responding to the marketing, based on the predictor information. In the voting solicitation context, the company "developed mathematic formulas based on such factors as length of residence, amount of money spent on golf, voting patterns in recent elections and a handful of other variables to calculate the likelihood that a particular American will vote Democratic."
Since the final goal is to choose ("microtarget" as in WSJ) the people who are most likely to vote democractic and solicit them to vote, the model should be able not necessarily to correctly classify as many people as possible (=model accuracy), but rather to be able to correctly rank people who are most likely to vote democratic. This is an important distinction: while some models can have very low accuracy, they might be excellent at capturing the top 10% of democratic voters. The main tool for assessing such performance is the lift-chart (AKA gains-chart).
So Michael - whether it is a logistic regression model, a classification tree, a neaural network, or any other classification method we will not know. But the term "mathematical formulas" does hint at statistically oriented models such as logistic regression rather than maching-learning methods. I suppose we should get our hands on such "publicly available datasets" and see what gives good lift!
Luckily, my student Michael Melcer just emailed me the complete article. He put it very nicely:
Hi Professor,
Thought you might find this article interesting. Sounds like politicians are using regression with a binary y to predict who is likely to vote for a party member.
Regards,Mike
So what exactly are we reading about? Apparently a direct marketing application used to capture people who are most likely to vote (democratic). "The technique aims to identify potential supporters by collecting and analyzing the unprecedented amount of information now readily available -- from census data to credit-card bills -- to profile individual voters."
In classic direct marketing, companies use information such as demographics and the historical relationship of the customer with the company (e.g., number of purchases, dates and amounts of purchases) as predictor variables. They use data from a pilot study or previous campaigns to collect information on the outcome variable of interest - did the customer respond to the marketing effort? (e.g., respond to a credit card solicitaion). Combining the predictor and outcome information, a model is created that predicts the probability of responding to the marketing, based on the predictor information. In the voting solicitation context, the company "developed mathematic formulas based on such factors as length of residence, amount of money spent on golf, voting patterns in recent elections and a handful of other variables to calculate the likelihood that a particular American will vote Democratic."
Since the final goal is to choose ("microtarget" as in WSJ) the people who are most likely to vote democractic and solicit them to vote, the model should be able not necessarily to correctly classify as many people as possible (=model accuracy), but rather to be able to correctly rank people who are most likely to vote democratic. This is an important distinction: while some models can have very low accuracy, they might be excellent at capturing the top 10% of democratic voters. The main tool for assessing such performance is the lift-chart (AKA gains-chart).
So Michael - whether it is a logistic regression model, a classification tree, a neaural network, or any other classification method we will not know. But the term "mathematical formulas" does hint at statistically oriented models such as logistic regression rather than maching-learning methods. I suppose we should get our hands on such "publicly available datasets" and see what gives good lift!
Wednesday, November 01, 2006
Numb3rs episode on logit function
The last episode of Numb3rs (the CBS show) that was broadcasted on Friday Oct 27 was called Longshot. Here is the description:
This is a nice way to introduce the building block for the logistic regression model, which relates a set of predictor variables to an outcome variable that is binary (e.g. buyer/non-buyer). Unlike a linear regression model where the (numerical) outcome variable is a linear function of the predictors, here the relationship is between the logit of the (binary) outcome variable and the predictors.
In the episode, Don brings Charlie a notebook that was found on the body. It
contains horse racing data and equations. Charlie determines that the equations
were designed to pick the SECOND place winner, not first place. Parts of these
equations use the "logit" function, a specific probability function that uses
logarithms and odds ratios. Because the logit function can get pretty
complicated, this activity lays its foundations, namely the relationship between
probability, odds, and odds ratios.
This is a nice way to introduce the building block for the logistic regression model, which relates a set of predictor variables to an outcome variable that is binary (e.g. buyer/non-buyer). Unlike a linear regression model where the (numerical) outcome variable is a linear function of the predictors, here the relationship is between the logit of the (binary) outcome variable and the predictors.
Tuesday, October 24, 2006
Webcasts on data mining
Moshe Cohen, a current MBA student in my class, pointed out an interesting set of webcasts on data mining called "Best Practices in Data Mining", by the Insightful corporation. Part I (now archived) describes a few scenarios where data mining is useful in the business context. They show some examples of questions of interest, datasets that are used in such applications, and the analysis process. Of course, their software InsightfulMiner is also showcased. I especially liked the emphasis on data visualization, and the SAS-EM-like "working" chart. They also discuss data preprocessing with some detail on missing values and outliers. Further topics include a bit on clustering, classification trees, and some evaluation methods.
Part II, upcoming on Nov 16, is supposed to cover additional topics. Here is the official description:
Part II, upcoming on Nov 16, is supposed to cover additional topics. Here is the official description:
Part 2 of the series will explore further data modeling methods such as
survival methods. In addition we will discuss how the results of data
mining can be leveraged to understand underlying customer behavior, for
example. Such knowledge is then valuable in choosing appropriate business
actions, such as designing an optimal marketing campaign.
Saturday, October 14, 2006
It's competition season: and now Netflix
An exciting new dataset is out there for us data aficionados! Netflix, the huge movie renter, announced a $1 million prize for the winner of a competition who can improve upon their Cinematch algorithm for predicting movie ratings. The competition started at the beginning of the month and has already created a lot of buzz. The company put out there a huge training set that includes millions of movie ratings. Competing teams can use this dataset to come up with prediction algorithms, and then submit predictions for a test set.
The training dataset contain more than 100 million ratings from a random sample of 480,000 (unidentifiable) users on 18,000 movies.
The $1 million grand prize goes to the team that can reduce the RMSE of Cinematch by 10% on the test set. There are also modest $50,000 "progress prizes".
Putting aside the monetary incentive, and the goal of beating Cinamatch on the test set, this is a great dataset for research purposes. And Netflix has been generous enough to allow usage of the data for research purposes.
Another fun aspect is to read the posting on the forum! The various opinions, questions, and answers are a feast for anyone interested in online communities.
The training dataset contain more than 100 million ratings from a random sample of 480,000 (unidentifiable) users on 18,000 movies.
The $1 million grand prize goes to the team that can reduce the RMSE of Cinematch by 10% on the test set. There are also modest $50,000 "progress prizes".
Putting aside the monetary incentive, and the goal of beating Cinamatch on the test set, this is a great dataset for research purposes. And Netflix has been generous enough to allow usage of the data for research purposes.
Another fun aspect is to read the posting on the forum! The various opinions, questions, and answers are a feast for anyone interested in online communities.
Saturday, October 07, 2006
Nation's favorite professors - in statistics???
When introductions are made, and the question comes "so what do you do?" I sheepishly reply "I teach statistics at University of Maryland's business school". The two most popular reactions are
(1) a terrified look -- "statistics? oh, I had to take that in undergrad!", or
(2) a dazed look -- "Wow!" [which really means, "I didn't understand any of it, so how did you figure it out?"]
But sometimes I do come across people who get all excited and say they took a statistics course and LOVED it. And very often it is attributable to the professor. Indeed, from my own school experience I found that statistics can be taught in extremely different ways: boring, scary, and vague, or exciting, useful, and challenging!
Our very own Professor Erich Studer-Ellis, who teaches the core statistics undergraduate classes at the business school, has just been named by BusinessWeek as one of the nation's favorite undergraduate business school professors. Yes -- this is possible! And this is in spite the huge class sizes that he teaches (typically in the hundreds). In fact, among our statistics professors, a majority have received teaching awards. But more importantly, when I meet their students, their eyes glow when they hear "statistics".
The bottom line is, therefore, don't create an impression of statistics based on a sample of n=1 course, unless that impression happens to be positive.
(1) a terrified look -- "statistics? oh, I had to take that in undergrad!", or
(2) a dazed look -- "Wow!" [which really means, "I didn't understand any of it, so how did you figure it out?"]
But sometimes I do come across people who get all excited and say they took a statistics course and LOVED it. And very often it is attributable to the professor. Indeed, from my own school experience I found that statistics can be taught in extremely different ways: boring, scary, and vague, or exciting, useful, and challenging!
Our very own Professor Erich Studer-Ellis, who teaches the core statistics undergraduate classes at the business school, has just been named by BusinessWeek as one of the nation's favorite undergraduate business school professors. Yes -- this is possible! And this is in spite the huge class sizes that he teaches (typically in the hundreds). In fact, among our statistics professors, a majority have received teaching awards. But more importantly, when I meet their students, their eyes glow when they hear "statistics".
The bottom line is, therefore, don't create an impression of statistics based on a sample of n=1 course, unless that impression happens to be positive.
Subscribe to:
Posts (Atom)