Last week Lori Rothenberg from SAS Higher Education visited our MBA class. She gave a 1-hour tutorial on SAS Enterprise Miner, which is the data mining software package by SAS. This is a pretty powerful tool, especially when dealing with large datasets. One of the nicest features of SAS EM is the "workspace", which displays a diagram of the entire modeling process, from data specification, through data manipulation, modeling, evaluation, and scoring.

Aside from software training, Lori described several real data mining projects, and how they were able to add value to the businesses. This further supports the course effort to emphasize the power of data mining in the business intelligence context.

The SAS Higher Education group offers tutorials, workshops and other resources (e.g., summer programs) to instructors - most of these are free! We've had a great experience with them.

## Thursday, November 30, 2006

## Saturday, November 11, 2006

### p-values do bite

I've discussed the uselessness of p-values in very large samples, where even miniscule effects become magnified. This is known as the divergence between practical significance and statistical significance.

An interesting article in the most recent issue of

Consider, for example, fitting the following regression model to data:

Sales = beta0 + beta1 TVAdvertising + beta2 WebAdvertising

(say, Sales are in thousands of $, and advertising is in $. )

Let's assume that we get the following coefficient table:

Coef (std err) p-value

TVAds 3 (1) 0.003

WebAds 1 (1) 0.317

We would reach the conclusion (at, say, a 5% significance level) that TVAds contribute significantly to sales revenue (after accounting for WebAds), and that WebAds do not contribute significantly to sales (after accounting for TVAds). Could we therefore conclude from these two opposite significance conclusions that the difference between the effects of TVAds and WebAds is significant? The answer is NO!

To compare the effects of TVads directly to WebAds, we would use the statistic:

T = (3-1) / (1^2 + 1^2) = 1

The p-value for this statistics is 0.317, which indicates that the difference between the coefficients of TVAds and WebAds is not statistically significant (at the same 5% level).

The authors give two more empirical examples that illustrate this phenomenon. There is no real solution rather than to keep this anomaly in mind!

An interesting article in the most recent issue of

*The American Statistician*describes another dangerous pitfall in using p-values. In their article*The Difference Between "Significant" and "Not Significant" is not Itself Statistically Significant*, Andrew Gelman (a serious blogger himself!) and Hal Stern warn that the comparison of p-values to one another for the purpose of discerning a difference between the corresponding effects (or parameters) is erroneous.Consider, for example, fitting the following regression model to data:

Sales = beta0 + beta1 TVAdvertising + beta2 WebAdvertising

(say, Sales are in thousands of $, and advertising is in $. )

Let's assume that we get the following coefficient table:

Coef (std err) p-value

TVAds 3 (1) 0.003

WebAds 1 (1) 0.317

We would reach the conclusion (at, say, a 5% significance level) that TVAds contribute significantly to sales revenue (after accounting for WebAds), and that WebAds do not contribute significantly to sales (after accounting for TVAds). Could we therefore conclude from these two opposite significance conclusions that the difference between the effects of TVAds and WebAds is significant? The answer is NO!

To compare the effects of TVads directly to WebAds, we would use the statistic:

T = (3-1) / (1^2 + 1^2) = 1

The p-value for this statistics is 0.317, which indicates that the difference between the coefficients of TVAds and WebAds is not statistically significant (at the same 5% level).

The authors give two more empirical examples that illustrate this phenomenon. There is no real solution rather than to keep this anomaly in mind!

## Thursday, November 02, 2006

### "Direct Marketing" to capture voters

OK, I admit it - I did peak over the shoulder of my fellow Metro rider last night (while returning from teaching Classification Trees), to better see her Wall Street Journal's front page. The article that caught my eye was "Democracts, Playing Catch-Up, Tap Database to Woo Potential Voters". I only managed to catch the first few paragraphs before the newspaper owner flipped to the next page.

Luckily, my student Michael Melcer just emailed me the complete article. He put it very nicely:

So what exactly are we reading about? Apparently a direct marketing application used to capture people who are most likely to vote (democratic). "The technique aims to identify potential supporters by collecting and analyzing the unprecedented amount of information now readily available -- from census data to credit-card bills -- to profile individual voters."

In classic direct marketing, companies use information such as demographics and the historical relationship of the customer with the company (e.g., number of purchases, dates and amounts of purchases) as predictor variables. They use data from a pilot study or previous campaigns to collect information on the outcome variable of interest - did the customer respond to the marketing effort? (e.g., respond to a credit card solicitaion). Combining the predictor and outcome information, a model is created that predicts the probability of responding to the marketing, based on the predictor information. In the voting solicitation context, the company "developed mathematic formulas based on such factors as length of residence, amount of money spent on golf, voting patterns in recent elections and a handful of other variables to calculate the likelihood that a particular American will vote Democratic."

Since the final goal is to choose ("microtarget" as in WSJ) the people who are most likely to vote democractic and solicit them to vote, the model should be able not necessarily to correctly classify as many people as possible (=model accuracy), but rather to be able to correctly rank people who are most likely to vote democratic. This is an important distinction: while some models can have very low accuracy, they might be excellent at capturing the top 10% of democratic voters. The main tool for assessing such performance is the lift-chart (AKA gains-chart).

So Michael - whether it is a logistic regression model, a classification tree, a neaural network, or any other classification method we will not know. But the term "mathematical formulas" does hint at statistically oriented models such as logistic regression rather than maching-learning methods. I suppose we should get our hands on such "publicly available datasets" and see what gives good lift!

Luckily, my student Michael Melcer just emailed me the complete article. He put it very nicely:

Hi Professor,

Thought you might find this article interesting. Sounds like politicians are using regression with a binary y to predict who is likely to vote for a party member.

Regards,Mike

So what exactly are we reading about? Apparently a direct marketing application used to capture people who are most likely to vote (democratic). "The technique aims to identify potential supporters by collecting and analyzing the unprecedented amount of information now readily available -- from census data to credit-card bills -- to profile individual voters."

In classic direct marketing, companies use information such as demographics and the historical relationship of the customer with the company (e.g., number of purchases, dates and amounts of purchases) as predictor variables. They use data from a pilot study or previous campaigns to collect information on the outcome variable of interest - did the customer respond to the marketing effort? (e.g., respond to a credit card solicitaion). Combining the predictor and outcome information, a model is created that predicts the probability of responding to the marketing, based on the predictor information. In the voting solicitation context, the company "developed mathematic formulas based on such factors as length of residence, amount of money spent on golf, voting patterns in recent elections and a handful of other variables to calculate the likelihood that a particular American will vote Democratic."

Since the final goal is to choose ("microtarget" as in WSJ) the people who are most likely to vote democractic and solicit them to vote, the model should be able not necessarily to correctly classify as many people as possible (=model accuracy), but rather to be able to correctly rank people who are most likely to vote democratic. This is an important distinction: while some models can have very low accuracy, they might be excellent at capturing the top 10% of democratic voters. The main tool for assessing such performance is the lift-chart (AKA gains-chart).

So Michael - whether it is a logistic regression model, a classification tree, a neaural network, or any other classification method we will not know. But the term "mathematical formulas" does hint at statistically oriented models such as logistic regression rather than maching-learning methods. I suppose we should get our hands on such "publicly available datasets" and see what gives good lift!

## Wednesday, November 01, 2006

### Numb3rs episode on logit function

The last episode of Numb3rs (the CBS show) that was broadcasted on Friday Oct 27 was called

This is a nice way to introduce the building block for the logistic regression model, which relates a set of predictor variables to an outcome variable that is binary (e.g. buyer/non-buyer). Unlike a linear regression model where the (numerical) outcome variable is a linear function of the predictors, here the relationship is between the logit of the (binary) outcome variable and the predictors.

*Longshot*. Here is the description:In the episode, Don brings Charlie a notebook that was found on the body. It

contains horse racing data and equations. Charlie determines that the equations

were designed to pick the SECOND place winner, not first place. Parts of these

equations use the "logit" function, a specific probability function that uses

logarithms and odds ratios. Because the logit function can get pretty

complicated, this activity lays its foundations, namely the relationship between

probability, odds, and odds ratios.

This is a nice way to introduce the building block for the logistic regression model, which relates a set of predictor variables to an outcome variable that is binary (e.g. buyer/non-buyer). Unlike a linear regression model where the (numerical) outcome variable is a linear function of the predictors, here the relationship is between the logit of the (binary) outcome variable and the predictors.

Subscribe to:
Posts (Atom)