## Friday, December 08, 2006

### Reverend Bayes and God

An interesting article in ScientificAmerican.com discusses What is Bayes's theorem, and how can it be used to assign probabilities to questions such as the existence of God?

The article describes Bayes' theorem which relates the conditional of event A given B with the event B give A. This is a very powerful theorem that has many practical applications. The reason is that we usually want to predict an event in the future (A) given an event in the past (B), but that is hard to do. Instead, we try the opposite scenario: if event A happens in the future, what is the probability that it was preceded by event B?

The pwerful Naive Bayes classifier is based on Bayes' theorem, with a slight modification.

The final part of the article discusses three books that have tried to use Bayes' theorem to compute the probability that God exists. The author very reasonably argues that this is not the right tool for answering this question.

An interesting question is whether Reverend Thomas Bayes himself reached his theorem while contemplating on theological issues? Did he even try to marry his theology and science?

## Thursday, November 30, 2006

### SAS Enterprise Miner

Last week Lori Rothenberg from SAS Higher Education visited our MBA class. She gave a 1-hour tutorial on SAS Enterprise Miner, which is the data mining software package by SAS. This is a pretty powerful tool, especially when dealing with large datasets. One of the nicest features of SAS EM is the "workspace", which displays a diagram of the entire modeling process, from data specification, through data manipulation, modeling, evaluation, and scoring.

Aside from software training, Lori described several real data mining projects, and how they were able to add value to the businesses. This further supports the course effort to emphasize the power of data mining in the business intelligence context.

The SAS Higher Education group offers tutorials, workshops and other resources (e.g., summer programs) to instructors - most of these are free! We've had a great experience with them.

## Saturday, November 11, 2006

### p-values do bite

I've discussed the uselessness of p-values in very large samples, where even miniscule effects become magnified. This is known as the divergence between practical significance and statistical significance.

An interesting article in the most recent issue of The American Statistician describes another dangerous pitfall in using p-values. In their article The Difference Between "Significant" and "Not Significant" is not Itself Statistically Significant, Andrew Gelman (a serious blogger himself!) and Hal Stern warn that the comparison of p-values to one another for the purpose of discerning a difference between the corresponding effects (or parameters) is erroneous.

Consider, for example, fitting the following regression model to data:

(say, Sales are in thousands of \$, and advertising is in \$. )

Let's assume that we get the following coefficient table:

Coef (std err) p-value

We would reach the conclusion (at, say, a 5% significance level) that TVAds contribute significantly to sales revenue (after accounting for WebAds), and that WebAds do not contribute significantly to sales (after accounting for TVAds). Could we therefore conclude from these two opposite significance conclusions that the difference between the effects of TVAds and WebAds is significant? The answer is NO!

To compare the effects of TVads directly to WebAds, we would use the statistic:

T = (3-1) / (1^2 + 1^2) = 1

The p-value for this statistics is 0.317, which indicates that the difference between the coefficients of TVAds and WebAds is not statistically significant (at the same 5% level).

The authors give two more empirical examples that illustrate this phenomenon. There is no real solution rather than to keep this anomaly in mind!

## Thursday, November 02, 2006

### "Direct Marketing" to capture voters

OK, I admit it - I did peak over the shoulder of my fellow Metro rider last night (while returning from teaching Classification Trees), to better see her Wall Street Journal's front page. The article that caught my eye was "Democracts, Playing Catch-Up, Tap Database to Woo Potential Voters". I only managed to catch the first few paragraphs before the newspaper owner flipped to the next page.

Luckily, my student Michael Melcer just emailed me the complete article. He put it very nicely:
Hi Professor,
Thought you might find this article interesting. Sounds like politicians are using regression with a binary y to predict who is likely to vote for a party member.
Regards,Mike

So what exactly are we reading about? Apparently a direct marketing application used to capture people who are most likely to vote (democratic). "The technique aims to identify potential supporters by collecting and analyzing the unprecedented amount of information now readily available -- from census data to credit-card bills -- to profile individual voters."

In classic direct marketing, companies use information such as demographics and the historical relationship of the customer with the company (e.g., number of purchases, dates and amounts of purchases) as predictor variables. They use data from a pilot study or previous campaigns to collect information on the outcome variable of interest - did the customer respond to the marketing effort? (e.g., respond to a credit card solicitaion). Combining the predictor and outcome information, a model is created that predicts the probability of responding to the marketing, based on the predictor information. In the voting solicitation context, the company "developed mathematic formulas based on such factors as length of residence, amount of money spent on golf, voting patterns in recent elections and a handful of other variables to calculate the likelihood that a particular American will vote Democratic."

Since the final goal is to choose ("microtarget" as in WSJ) the people who are most likely to vote democractic and solicit them to vote, the model should be able not necessarily to correctly classify as many people as possible (=model accuracy), but rather to be able to correctly rank people who are most likely to vote democratic. This is an important distinction: while some models can have very low accuracy, they might be excellent at capturing the top 10% of democratic voters. The main tool for assessing such performance is the lift-chart (AKA gains-chart).

So Michael - whether it is a logistic regression model, a classification tree, a neaural network, or any other classification method we will not know. But the term "mathematical formulas" does hint at statistically oriented models such as logistic regression rather than maching-learning methods. I suppose we should get our hands on such "publicly available datasets" and see what gives good lift!

## Wednesday, November 01, 2006

### Numb3rs episode on logit function

The last episode of Numb3rs (the CBS show) that was broadcasted on Friday Oct 27 was called Longshot. Here is the description:

In the episode, Don brings Charlie a notebook that was found on the body. It
contains horse racing data and equations. Charlie determines that the equations
were designed to pick the SECOND place winner, not first place. Parts of these
equations use the "logit" function, a specific probability function that uses
logarithms and odds ratios. Because the logit function can get pretty
complicated, this activity lays its foundations, namely the relationship between
probability, odds, and odds ratios.

This is a nice way to introduce the building block for the logistic regression model, which relates a set of predictor variables to an outcome variable that is binary (e.g. buyer/non-buyer). Unlike a linear regression model where the (numerical) outcome variable is a linear function of the predictors, here the relationship is between the logit of the (binary) outcome variable and the predictors.

## Tuesday, October 24, 2006

### Webcasts on data mining

Moshe Cohen, a current MBA student in my class, pointed out an interesting set of webcasts on data mining called "Best Practices in Data Mining", by the Insightful corporation. Part I (now archived) describes a few scenarios where data mining is useful in the business context. They show some examples of questions of interest, datasets that are used in such applications, and the analysis process. Of course, their software InsightfulMiner is also showcased. I especially liked the emphasis on data visualization, and the SAS-EM-like "working" chart. They also discuss data preprocessing with some detail on missing values and outliers. Further topics include a bit on clustering, classification trees, and some evaluation methods.

Part II, upcoming on Nov 16, is supposed to cover additional topics. Here is the official description:
Part 2 of the series will explore further data modeling methods such as
survival methods. In addition we will discuss how the results of data
mining can be leveraged to understand underlying customer behavior, for
example. Such knowledge is then valuable in choosing appropriate business
actions, such as designing an optimal marketing campaign.

## Saturday, October 14, 2006

### It's competition season: and now Netflix

An exciting new dataset is out there for us data aficionados! Netflix, the huge movie renter, announced a \$1 million prize for the winner of a competition who can improve upon their Cinematch algorithm for predicting movie ratings. The competition started at the beginning of the month and has already created a lot of buzz. The company put out there a huge training set that includes millions of movie ratings. Competing teams can use this dataset to come up with prediction algorithms, and then submit predictions for a test set.

The training dataset contain more than 100 million ratings from a random sample of 480,000 (unidentifiable) users on 18,000 movies.

The \$1 million grand prize goes to the team that can reduce the RMSE of Cinematch by 10% on the test set. There are also modest \$50,000 "progress prizes".

Putting aside the monetary incentive, and the goal of beating Cinamatch on the test set, this is a great dataset for research purposes. And Netflix has been generous enough to allow usage of the data for research purposes.

Another fun aspect is to read the posting on the forum! The various opinions, questions, and answers are a feast for anyone interested in online communities.

## Saturday, October 07, 2006

### Nation's favorite professors - in statistics???

When introductions are made, and the question comes "so what do you do?" I sheepishly reply "I teach statistics at University of Maryland's business school". The two most popular reactions are
(1) a terrified look -- "statistics? oh, I had to take that in undergrad!", or
(2) a dazed look -- "Wow!" [which really means, "I didn't understand any of it, so how did you figure it out?"]

But sometimes I do come across people who get all excited and say they took a statistics course and LOVED it. And very often it is attributable to the professor. Indeed, from my own school experience I found that statistics can be taught in extremely different ways: boring, scary, and vague, or exciting, useful, and challenging!

Our very own Professor Erich Studer-Ellis, who teaches the core statistics undergraduate classes at the business school, has just been named by BusinessWeek as one of the nation's favorite undergraduate business school professors. Yes -- this is possible! And this is in spite the huge class sizes that he teaches (typically in the hundreds). In fact, among our statistics professors, a majority have received teaching awards. But more importantly, when I meet their students, their eyes glow when they hear "statistics".

The bottom line is, therefore, don't create an impression of statistics based on a sample of n=1 course, unless that impression happens to be positive.

## Friday, October 06, 2006

### Time Series Forecasting Competition

Forecasting transportation demand is important for multiple goals such as staffing, planning, and inventory control. The public transportation system in Santiago de Chile is currently going through a major effort of reconstruction (if you read Spanish, you can find more at www.transantiago.cl).

The 2006 Business Intelligence Competition (BI CUP 2006) focuses on forecasting demand for public transportation. They provide a training set of a time series of passengers arriving at a terminal, and the competitors must come up with a method for forecasting the test set, which comprises of a few future days.

Although this problem is a great example of data mining for business intelligence, the winning criterion is the model that generates the smallest mean absolute error (MAE), also known as mean absolute deviation (MAD). In other words, the closeness of the forecasts to the actual values in the test set is the only criterion for winning. This setup makes this more of a pure data mining problem, and much less one of business intelligence. Clearly, the most accurate model might be completely impractical. For example, it might be very computationally intensive, when in practice the model is supposed to produce real-time forecasts for many different series simultaneously. Or, there might be different costs associated with forecast errors at different times of the day (or difference days of the week). These types of considerations, when included in the modeling phase, turn the data mining task into a business related one.

The competition organizers promised to provide more of the business details after the competition ends, at the end of October.

5 teams of MBAs in my data mining class are now working hard on this forecasting problem. A few might decide to formally participate in the competition.

Good luck to the competitors!

## Tuesday, September 26, 2006

### Cheating in MBA programs

First, Noah Kauffman, an ex-MBA student of mine emailed me the story, then I found it in BusinessWeek, and a quick search brought up the story in many news sources, university websites, and magazines. Each had a different title. Here are some examples:
MBA Students Are No. 1 - At Cheating (BusinessWeek, Oct 2 issue, page 14)
A Crooked Path Through B-School (BusinessWeek Online)
MBA Students Likelier to Cheat (Toronto Star)
National survey: MBA cheating prevalent (The Cavalier Daily)

All sources report the following about the study:

Linda TreviÃ±o, Franklin H. Cook Fellow in Business Ethics at Penn State's Smeal College, and her colleagues Donald McCabe of Rutgers and Kenneth Butterfield of Washington State examined survey results from 5,331 students at 32 graduate schools in Canada and the United States. They found that 56 percent of graduate business school students admitted to cheating one or more times in the past academic year compared to 47 percent of non-business students.

All of the articles that I found discuss the reasons and possible solutions to the cheating. But none describe more details about the study itself. So let's look at the numbers and what the research question is. First, we learn that the study compared the rate of business to non-business students who admit to cheating. The sample estimates were 56% for MBAs vs. 47% for non-MBAs. Does this sample difference generalize to the entire population of graduate students? Can we say that in general MBAs cheat more than other grad students? To find out, we need to know the breakdown of the sample (n=5331) into business and non-business students. Since I couldn't find it, let's go in the reverse direction -- what type of breakdown would lead us to believe in a real difference between the proportion of MBA vs. other grad cheaters?

You might recall from Stat101 a procedure for comparing proportions from two independent samples. To use this, we must assume that the MBAs and other grads consist of two independent samples (e.g., there were no MBAs who were also studying towards a different graduate degree). In that case, we take the difference between the sample proportions: 0.56-0.47=0.09 and see far it is from zero, in standard errors. To compute the standard error we use the formula:
standard error = square-root{ p (1-p) (1/n1 + 1/n2) }

where n1 is the sample size of MBAs, n2 is the sample size of non-MBAs, and p is the weighted average of 0.47 and 0.56, weighted by the corresponding sample sizes. Since we don't know n1 or n2, I tried different values (remember that n1+n2=5331, so I only have to set n1). Here is what I get:

If the samples are relatively balanced (e.g., n1=2600 and n2=2731), then the distance between the MBA and non-MBA proportion of cheaters is more than 6 standard errors! This is a pretty compelling distance, that supports the study's claim. If, on the other hand, the samples are very imbalanced, then we can get opposite results. For example, if the MBA sample had n1=100 students and the non-MBA sample had n2=5231 students, then the difference between 47% and 56% is less than 2 standard errors, which might be considered too weak of an evidence.

The bottom line is that we really want to know more about the numbers from the study. Besides the breakdown of MBA and non-MBA samples, what was the response rate to the survey? Did all 5331 students reply? How were the samples drawn from the population of b-schools and other graduate programs? etc.

I guess we'll have to wait for the article, entitled "Academic dishonesty in graduate business programs: Prevalence, Causes and Proposed Action", which will be published in a forthcoming issue of the Academy of Management Learning and Education.

## Thursday, September 21, 2006

### Dylan on data exploration

The ease of use of many data analysis and data mining software packages has lead to the dangerous tendency to jump to the model fitting stage without proper data exploration. Getting an initial understanding of the data via summarization and visualization is crucial for building good models.

Mike Melcer, a current MBA student in my data mining class, mentioned that Bob Dylan knew this well. He sings You don't need a weatherman to know which way the wind blows (from Subterranean Homesick Blues). The weatherman can, however, quantify the speed of the wind and the temperature. In other words, the modeling phase is there to formalize and quantify what you learn in the data exploration phase. But you do have to stick your head out of the window first.

## Tuesday, September 12, 2006

### What are decision trees?

The term "decision tree" has been used in two very different contexts, which causes some confusion. In the context of decision sciences (or decision making), it means a tree structure that assist in decision making, by mapping the different courses of action and assigning costs and probabilities to the different scenarios. There is a good description on MindTools website.

In contrast, "decision trees" are also a popular name for classification trees (or regression trees), a data mining method for predicting an outcome from a set of predictor variables (see, for example, the description on Resample.com). Two well-known types of classification tree algorithms are CART (implemented in software such as CART, SAS Enterprise Miner, and the Excel add-on XLMiner) and C4.5 (implemented in SPSS). An alternative algorithm, which is more statistically oriented and widely used in marketing, is CHAID (implemented in multiple software packages).

Both types of decision trees are tools that are very useful in business applications and decision making. They both use a tree-structure and can generate rules. But otherwise, they are quite different in what they are used for, and how they operate. The decision-sciences decision tree relies on the expert to build the scenarios, assess costs and probabilities of events. In contrast the data-mining decision tree uses a large database of historic data to come up with rules that relate an outcome of interest with a set of predictor variables.

To see how much of a confusion the use of the same term for the two tools causes, check out the definition of Decision Tree in wikipedia. The first paragraph refers to decision theory, while all the rest is the data mining version... So next time, when decision trees are mentioned, make sure sure to find out which tool they are talking about!

## Wednesday, August 16, 2006

### Webcast on Aug 17: Teaching Analytics in the B-School

Everything you wanted to know about teaching data mining at the B-school!

On Aug 17 at 13:00 my colleague Ravi Bapna and I will be hosted on a SAS webcast on teaching analytics in the business school. We will discuss the skills that students in such courses obtain and the growing demand in the market; teaching approaches and how to go about teaching such a course; how it ties to research and corporate involvement, and more.

To view and participate in the webcast, you can register at http://www.sas.com/govedu/events/112592/index.html . The webcast will also be archived and freely available later on the SAS website.

## Sunday, July 30, 2006

### Summer break

Please hold your breath for a little longer until I retain full speed and continue posting to Bzst. Even statisticians need a break! In the meantime, I'll just report that:

1. My evening data-mining for MBAs class that I will be teaching in Fall is almost full.

2. The textbook "Data Mining for Business Intelligence" that I co-authored is in press. But you can get a sneak preview at www.dataminingbook.com.

## Sunday, May 21, 2006

### Data Mining for Business Applications Workshop

The upcoming International Conference on Knowledge Discovery and Data Mining (KDD) conference (August in Philadelphia) will feature a workshop on "Data Mining for Business Applications". The goals of the workshop are stated as:

1. Bring together researchers (from both academia and industry) as well as practitioners from different fields to talk about their different perspectives and to share their latest problems and ideas.
2. Attract business professionals who have access to interesting sources of data and business problems but not the expertise in data mining to solve them effectively.

I love attending KDD - it is a fun conference with lots of interesting talks and posters, which attracts both industry people as well as academics from artificial intelligence/maching-learning and a few statisticians (the cool ones, of course). Aside from the main conference there is a variety of workshops and tutorials. This conference has a competitive acceptance rate for papers, which guarantees high quality.

See you in Philly!

The latest AMSTAT NEWS, which is the monthly magazine of the American Statistical Association has an interesting article by Bonnie Ray, a statistician at IBM Watson Research Center. She describes the wealth of activities (sections, conferences, etc.) by the sister organization INFORMS that are aimed at bringing together academics with industry professionals. In particular, she mentions the huge gap in the field of business and the burning need for quantitative and "statistically literate" experts in businesses.

I believe that one GREAT resource is the MBA program. Some of the students who take (in addition to a core statistics course) a hands-on, business-oriented data mining/analysis course have a big advantage: they not only understand and tried out some analysis, but they are well versed in the business world, in their field of concentration (marketing, finance, etc.) Some of my top students would be an incredible asset to any company.

It is prime time for the statistics community to embrace MBA programs and not only teach statistics, but also learn more about its use, challenges, and real applications in the business context.

## Thursday, May 04, 2006

### Special issue of Statistical Science on "Statistical Challenges in eCommerce Research"

Yes, the special is coming out this month! It will contain a collection of papers by statisticians and non-statisticians (researchers from information systems, marketing, and more). Lots of great data, methods, analyses, and open questions.

For those not familiar with the journal Statistical Science, this is a really neat and readable statistical journal that features special issues on interdisciplinary areas, interviews with famous statisticians, and more. My colleague Anindya Ghose from the Stern School of Business just sent me a few websites showing the HUGE impact factor of this journal (this is the ratio of the number of citations to the number of published articles) :
• According to Thomson's Sci-Bytes, in 2000-2004 it had an impact factor of 4.9, and was in the 4th place, after Bioinformatics, J. Computational Biology and Econometrica.
• Ranked 6th on the list of Top Journal in Statistics (among 72 journals) in an analysis by Professor Wayne Oldford from the University of Waterloo (and director of the center for computational mathi in industry and commerce) .
• Ranked as one of the two "Top Tier Review Journals" by Dept of Statistics at Florida State U (in a document that ranks journals for purpses of tenure and promotion...)

So hold your breath for the May issue. A preview of the cover and Table of Contents will be available at the 2nd Symposium on "Statistical Challenges in eCommerce" at Minneapolis (May 22-23). A leak from the editor says that the issue will also include an interview with a famous statistician...

## Friday, April 28, 2006

### p-values in LARGE datasets

We had an interesting discussion in our department today, the result of confining statisticians and non-statisticians in a maize-like building. Our colleague who called himself "non-stat-guru" sent a query to us "stat-gurus" (his labels) regarding p-values in a model that is estimated from a very large dataset.

The problem: a cetain statistical model was fit to 120,000 observations (that's right, n=120K). And obviously, all p-values for all predictors turned out to be highly statistically significant.

Why does this happen and what does it mean?
When the number of observations is very large, standard errors of estimates become very small: a simple example is the standard error of the mean which is equal to std/sqrt(n) . Plug 1 million in that denominator! This means that the model has power to detect even miniscule changes.

For instance, say we want to test whether the average population IQ is 100 (remember that IQ scores are actually calibrated so that the average is 100...). We take a sample of 1 million people, measure their IQ and compute the mean and standard deviation. The null hypothesis is

H0: population mean (mu) = 100
H1: mu NOT 100

The test statistic is: T = {sample mean - 100 } / {sample std / sqrt(n)}

the n=1,000,000 inflates the numerator of the T statistic and will make it statistically significant for even a sample mean of 100.000000000001. But is such a different practically significant??? Of course not.

The problem, in short, is that in large datasets statistical significance is likely to diverge from practical significance.

What can be done?

1. Assess the magnitude of the coefficients themselves and what their interpretation is. Their practical significance might be low. For example, in a model for cigarette box demand in a neighborhood grocery store, such as demand = a + b price, we might find a coefficient of b=0.000001 to be statistically significant (if we have enough observations). But what does it mean? An increase of \$1 in price is associated with an average increase of 0.000001 in the number of cigerette boxes sold. Is this relevant?

2. Take a random sample and perform the analysis on that. You can use the remaining data to test the robustness of the model.

Next time before driving your car, make sure that your windshield was not replaced with a magnifying glass (unless you want to detect every ant on the road).

## Wednesday, April 26, 2006

### Symposium on Statistical Challenges in eCommerce Research

The second symposium on statistical chellnges in eCommerce will take place at the Carlson School of Management, University of Minnesota, May 22-23. For further details see http://misrc.csom.umn.edu/symposia/2006.05.22/

This symosium follows up the inaugural event held at the R. H. Smith School of Management of the University of Maryland last year, which brought together almost 100 researchers from the fields of information systems, statistics, data mining, marketing, and more. It was a stimulating event with lots of energy.

Last year's event lead to collaborations, discussions, and a special issue of the high-imparct journal Statistical Science which should be out in May or August.

There is still a short time to submit an abstract (work in progress is welcome!)

## Tuesday, April 18, 2006

### Interactive visualization of data

Two interesting articles describe how interactive visualization tools can be used for deriving insight from large business datasets. They both describe tools developed by the Human-Computer Interaction Lab at the University of Maryland. For James Bond fans, this place reminds me of Q branch -- they come up with amazingly cool visualization tools that save your day.

The first article "Describing Business Intelligence Using Treemap Visualizations" by Ben Shneiderman describes Treemap, a tool for visualizing hierarchical data. Know smartmoney.com's "Map of the Market"? Guess where that came from!

The second article "The Surest Path to Visual Discovery" by Stephen Few describes Timesearcher, an interactive visualization tool for time series data, such as stocks. For full disclosure, I've been involved in the development of the current version in adapting it to eCommerce-type data and in particular auction data such as those from eBay. Timesearcher2 is capable of displaying tightly-coupled times series and cross-sectional data (ever seen anything like it?) That means that each auction consists of a time-series describing the bid history but also cross-sectional data such as the seller id and rating, item category, etc. For more details on the auction implementation check out our paper Exploring Auction Databases Through Interactive Visualization.

### You can't escape Bayes...

My students are currently studying for a quiz on classification. One of the classifiers that we talked about is the Naive Bayes classifier. On Saturday evening I received a terrific email from Jason Madhosingh, one of my students. He writes:

So, I'm taking a break from studying this evening by watching "numb3rs", a
CBS crime drama where a mathematician uses math to help solve crimes (go
figure). Of course... he brings us Bayes theorem. There truly is no escape!

And as a visualization junkie, I was also thrilled about his last comment "They also did a 3D scatterplot"!

I guess I'll have to check out this series - does anyone have it taped? From the CBS website it looks like I can even get early previews as a teacher...

Teachers can opt in to order a Teaching Kit including a specially designed classroom poster, and can view the new classroom activities coordinated with each show episode on this website, a week prior to the show.

## Monday, April 10, 2006

### Patenting predictive models?

A curious sentence in a short BusinessWeek report sent me hunting for clues. In Rep of a (Drug) Salesman, a consulting firm by the name of TargetRx "claims it can identify what really makes a sales rep effective".

From a press release on TargetRx's website I found the following:

Data collected from physicians via survey are then merged with actual prescribing and other behavioral data and analyzed using proprietary analytic methods to develop predictive models of physician prescribing behavior. The proprietary analytics are based in part on TargetRx's patent-pending Method and
System for Analyzing the Effectiveness of Marketing Strategies. TargetRx
received notice of allowance on this patent application from the U.S. Patent and
Trademark Office in January 2006. The unique method of collecting and analyzing
data enables TargetRx to predict prescribing changes as well as decompose
prescribing to understand what specific aspects of the promotion, product, or
physicians' interactions with patients and payors are causing changes.

I was not able to find any further details. The question is what here is proprietary? What does "unique method of collecting and analyzing" mean? Conducting surveys is not new, so that's not it. Linking the two data sources (survey data and prescribing data) doesn't sound too hard, if there is a unique identifier for each doctor. Has TargetRX developed a new predictive method???

Apropos sales reps and prescribing doctors, an interesting paper1 by Rubin and Waterman, professors of statistics from Harvard and UPenn, discusses the use of "propensity scores" for evaluating causal effects such as marketing interventions. Unlike regression models that cannot prove causality, propensity scores compare matched observations, where matching is based on the multivariate profile of each observation. The application in the paper is a model that ranks the most likely doctors that will increase their prescribing due to a sales rep visit. An ordinary regression model that compares the number of prescriptions of doctors who were visited by sales reps and those who did not cannot account for phenomena such as: sales reps prefer visiting high-prescribing doctors because their compensation is based on the number of prescriptions!

1 "Estimating the Causal Effects of Marketing Interventions Using Propensity Score Methodology", D B Rubin and R P Waterman (2006), Statistical Science, special issue on "Statistical Challenges and Opportunities in eCommerce", forthcoming.

### Data mining and privacy

BusinessWeek touched upon a sensitive issue in the article If You're Cheating on Your Taxes. It's about federal and state agencies using data mining to find "the bad guys".

Although "data mining" is the term used in many of these stories, a more careful look reveals that there are hardly any advanced statistical/DM methods involved. The issue is the linkage/matching of different data sources. In the Statistical Challenges & Opportunities in eCommerce symposium last year, Stephen Fienberg, a professor of statistics at CMU and an expert on disclosure limitation showed a semi-futuristic movie on a pizza parlor using an array of linked datasets to "customize" a delivery call (from the American Civil Liberties Union website). He also wrote a paper on privacy and data mining1 that will come out soon in a special issue of the journal Statistical Science on the same topic (OK, I'll disclose that I co-edited this with Wolfgang Jank).

Another interesting document can be found on the American Statistical Association's website: FAQ Regarding the Privacy Implications of Data Mining.

The bottom line is that statistical/data mining methods or tools are not the evil. In fact, in some cases statistical methods allow the exact opposite: disclosing data in a way that allows inference but conceals any information that might breach privacy. This area is called Disclosure Limitation and is studied mainly by statisticians, operations researchers, and computer scientists.

1 "Privacy and Confidentiality in an E-Commerce World: Data Mining, Data Warehousing, Matching, and Disclosure Limitation," S E Fienberg (2006), Statistical Science, special issue on "Statistical Challenges and Opportunities in eCommerce", forthcoming.

### More on predictive vs. explanatory models

This week the predictive vs. explanatory modeling came up in multiple occasions: First, in a study with an information systems colleague where the goal is to build a predictive application for ranking the most-likely auctions to transact; Then, an example that I gave in class of modeling eBay data in to distinguish competitive from non-competitive auctions. And then, a bunch of conversations with students that followed.

The point that I want to make here, which I did not mention directly in my previous post on this subject, is that the set of PREDICTORS your model will include can be very different if the goal is explanatory vs. predictive. Here's the eBay example: we have data on a set of auctions from eBay (from publicly available data on eBay.com). For each auction there is information on the product features (e.g., category, new/used), seller's features (e.g., rating), and auction features (e.g., duration, opening price, closing price).

Explanatory goal: To determine factors that lead auctions to be competitive (i.e., receive more than 1 bid).

Predictive goal: To build a seller-side application that will predict the chances that his/her auction will be competitive.

In the explanatory task, we are likely to include the closing price, hypothesizing that (perhaps) lower priced items are more likely to be competitive. However, for the predictive model we cannot include closing price, because it is not known at the start of the auction! In other words, we are constrained to information that is available at the time of prediction.

Until now I have not found a published focused discussion on predictive modeling vs. building explanatory models. Statistics books tend to focus on explanatory models, whereas machine-learning sources focus on predictive modeling. Has anyone seen such a discussion?

## Friday, March 31, 2006

### Data mining courses in MBA programs

Data mining and advanced data analysis courses are becoming more and more popular in B-schools. I asked colleagues about such electives in their universities, and also did some web- searching. It looks like different courses range in topics and flavor, depending in many times on the instructor's field of expertise (statistics, operations research, information systems, marketing, machine learning, etc). But all courses typically revolve around real business applications.

Here's an initial list that I've composed (alphabetical in School name). I hope that others who teach or are taking such courses will add to the list. Links to syllabuses are also welcome. This can be a gerat resource for instructors (sharing information, material, experiences, etc.) as well as for MBA students!

Fuqua School of Business, Duke: Data Mining (MGRECON 491)

Indian School of Business (ISB): Business Intelligence Using Data Mining (update by R Bapna)

Kelley School of Business, Indiana University: Data Mining (K513) Syllabus

Lundquist College of Business, U of Oregon: Information Analysis for Managerial Decisions (DSC 433/533) syllabus

McCombs School of Business, UTexas at Austin: Data Mining (MIS 382N.9)

Robert H Smith School of Business, U of MD College Park: Data Analysis for Decision Makers (BUDT733) syllabus

Sloan School of Management, MIT: Data Mining (15.062)

Stern School of Business, NYU: Data Mining and Knowledge Systems (B20.3336)

Tepper School of Business, CMU: Mining Data for Decision Making (45-863)

Wharton School, UPenn: Decision Support Systems (OPIM-410-/672)

Online courses:

Cardean Online University: Data Mining (EIMA 714)

Statistics.com online courses: Intro to Data Mining and Data Mining 2

## Thursday, March 30, 2006

### A cool book

One of my favorite statistics books that just makes you want to learn data analysis is Freakonomics : A Rogue Economist Explores the Hidden Side of Everything by Steven Levitt - an economist from the University of Chicago who was recently awarded the John Bates Clark Medal, and Stephen Dubner - an author and writer for the NYT and The New Yorker. Together they created a rare product: a data analysis books to take to bed!

The book describes several somewhat-wild studies in an attempt to answer questions that your kids might have asked you: "Why do drug dealers still live with their moms?" or "What do schoolteachers and Sumo wrestlers have in common?". The beauty of these studies is the creativity in the question of interest, sometimes the data collection or experimental design, structuring the analysis, and the interpretation of the results. It shows how careful modeling can lead to very interesting (if somewhat untraditional) insights.

It is an easy, fun read. And for those who know something about regression models, that is the main tool used. Recommending this to students at the start of the data analysis class usually brings a few back with starry eyes.

Turns out that the book is actually used in a variety of university courses and there is even a free study guide!

And of course, if you get totally hooked, there is a freakonomics blog.

## Tuesday, March 28, 2006

### Stairway to Heaven

"There's a lady who's sure all that glitters is gold
And she's buying a stairway to heaven
And when she gets there she knows if the stores are closed
With a word she can get what she came for"

Led Zeppelin were on to something -- The Mar-27 BusinessWeek cover story includes a report on BestBuy who'se trying to escape commodity hell via customer segmentation: "The company has divided its customers into five distinct demographic groups and is doing extensive market research to figure out how to serve them better"

On their website, a press release gives further details:
"As part of its customer centricity initiative, Best Buy identified five initial customer segments that it believes represent significant new growth opportunities or include some of the company's most profitable customers today. The segments include:
1. The affluent professional who wants the best technology and entertainment experience and who demands excellent service.
2. The focused, active, younger male customer who wants the latest technology and entertainment.
3. The family man who wants technology that improves his life - the practical
4. The busy suburban mom who wants to enrich her children's lives with technology and entertainment.
5. The small business customer who can use Best Buy's product solutions and services to enhance the profitability of his or her business. "

The million \$ question is how did they arrive at these segments? What does "initial customer segments" mean? why "believe"? Why exactly 5? Is there a cluster analysis going on behind the scenes, or is it time for one?

## Monday, March 20, 2006

### Data mining for prediction vs. explanation

A colleague of mine and I have an ongoing discussion about what data mining is about. In particular, he claims that data mining focuses on predictive tasks, whereas I see the use of data mining methods for either prediction or explanation.

The distinction between predictive and explanatory tasks is not always easy: Of course, in both cases the goal is "future actionable results". In general, an explanatory goal (also called characterizing or profiling) is when we're interested in learning about factors that affect a certain outcome. In contrast, a predictive goal (also called classification, when we have a categorical response) is when we care less about "insights" and more about accurate predictions of future observations.

In some cases it is very easy to identify an explanatory task, because it simply doesn't make any sense to predict the outcome. For example, we might be looking at performance measures of male and female CEOs, in an attempt to understand the differences (in terms of performance) between the genders. In this case, we would most likely not try to predict the gender of a new observation based on their performance measures...

In most cases, however, it is harder to decide what the task is. This is especially because of the ambiguity of problem descriptions. People will say "I'd like you to use these data to see if we can understand the factors that lead to defaulting on a loan, so that we can predict for new applicants their chance to default". Since it is detrimental for the analysis to know the type of task (explanatory vs. predictive) is requires, I believe the only way out is to try and squeeze it out of the domain person by questions such as "what do you plan to do with the analysis results?" or "what if I found that..."

Here's an example: A marketing manager at a cable company hires your services to analyze their customer data in order "to know the household-characteristics that distinguish subscribers to premium channels (such as HBO) from non-subscribers." Is this a predictive goal? explanatory? It depends! Here are possible scenarios of what they will use the analysis for:
1. To market the premium channel plan in a new market only to people who are more likely to subscribe (predictive)
2. To re-negotiate the cable company's contract with HBO based on what they find (explanatory) -- this scenario is courtesy of a current MBA student

How will the analysis differ?
Whether the goal is predictive or explanatory, the analysis process will take different avenues. This includes what performance measures are devised and used (e.g., to partition or not to partition? use an R-squared or a MAPE?), what types of methods are employed (a k-nearest neighbor won't have much explanatory power), etc.

## Thursday, March 09, 2006

### Impact of weather on the economy

The March 6 issue of BusinessWeek reports an increase in retail sales and housing starts this January (What Got The Economy's Bouce Going). The hypothesis is that the cause is the relative warm weather (an average of 39.6F). So how can this be tested? BW mention a study by James O'sullivan, an economist with UBS, that tries to quantify the impact of warm January weather on retail sales and housing starts. They describe it as follows:

He looked at the historical relationship between data from several economic reports and deviations from the average December and January temperatures. December temperatures were used to capture any weather-related distortions that could carry over into January.

This doesn't tell us much about the model. The clue is in the reported results:

Based on these past relationships, the above-average January temperatures provided a 1.4-percentage-point boost to retail sales. Housing starts got a weather-related increase of approximately 200,000 units at an annualized rate, while the balmy temperatures may have accounted for all of the 0.7% rise in manufacturing output.

Here's my guess: this is a set of regression models! For retail sales, it might look like this:

log(retail sales) = a + b*(deviation from average Dec-or-Jan temp)+ noise
The model for housing starts:
housing starts = a + b*(deviation from average Dec-or-Jan temp)+ noise

The model for manufacturing output:

log(housing starts) = a + b*(deviation from average Dec-or-Jan temp)+ noise
How do we know whether there is a log on the left hand side? Well, if the results are reported in percentages (a unit change in X is associated with a percentage change in Y), then there is most likely a log. In contrast, if the results are reported in the units of Y (for instance, units of housing starts), then there is no log.
A final interesting note made in the article is that "government's seasonal adjustment process, which tries to account for typical seasonal variation, can go awry when patterns are atypical". I'll describe how these seasonal adjustments are done in a future post.

## Thursday, March 02, 2006

### What jobs does a good data mining course land you?

I always love to hear from former students about the great jobs that they landed after graduating from the MBA program. It is especially great (and I must admit - surprising) when the data mining course they took at the Smith School was instrumental is getting that job.

Here's one of the more unusual stories (quoting from a former student's email):

"Just a couple weeks before completing the requirements of the program, thanks in no small part to what I learned [in the data mining course] ... I have accepted an offer for a dream job in the entertainment industry. Based on the analytical skills I cultivated through lessons on data modeling and analysis, and the analytical tools and techniques ... I have been selected by my company for a newly created position designed to help the nation’s largest concert promoter figure out how to optimize its booking strategy for major national tours. It’s about supplementing the intuition of industry experts and veterans with analytical insight, and introducing more method to the madness that characterizes the business side of Rock and Roll. Who said math and statistics isn’t cool?"

Any others out there with a cool story? Please post!

## Monday, February 27, 2006

In this week's issue of BusinessWeek (March 6, 2006), an article called The secret to Google's success describes a study by three economists showing that Google's mechanism for auctioning ad space (called AdWords), which is supposed to be a second-price auction, actually "differs in a key respect from the one economists had studied".

I tracked down a report on this study ("The high price of internet keyword auctions" by Edelman, Ostrovsky, and Schwarz) to find out more. And I found out something that is directly related to our work on eBay auctions...

Starting from the basics, a second-price auction is one where the highest bidder is the winner, and s/he pays the second highest bid (+ a small increment). This is also the format used in most of eBay's auctions. According to auction theory (derived from game theory), in a second-price auction the optimal bidding strategy should be to bid your true valuation. If you think the item is worth \$100, just bid \$100. Going back to the study, the economists found that the mechanism used by Google's AdWords does NOT lead to this "truth telling". Instead, sophisticated users actually tend to under-bid.

The authors' recommendation is "that search engines consider adopting a true Vickrey setup... [where] the system and bids would remain relatively static, changing only when economic fundamentals changed". And this is where I disagree: I have been conducting empirical research of online auctions from a different, non-economist perspective. Instead of starting from economic theory and trying to see how it manifests in the online setting, I examine the online setting and try to characterize it using statistical tools. An important hypothesis that my colleague Wolfgang Jank and I have is that the auction price is influenced not only by factors that economic theory sets (like the opening price and number of bidders), but also by the dynamics that take place during the auction. The online environment has very different dynamics than the older offline version. Think of the psychology that goes on when you are bidding for an item on eBay. This means that perhaps classic auction theory does not account for new factors that might determine the final price. Of course, this claim has always won us some frowns from hard-core economists...

For example, many empirical researchers have found in eBay (second-price) auctions that bidders do not follow the "optimal bidding strategy" of bidding their truthful valuation. In fact, on eBay many bidders tend to revise their bids as the auction proceeds (the phenomenon of last-moment-bidding, or "sniping", is also related). There have been different attempts to explain this through economic theory, but there hasn't been one compelling answer.

In light of my eBay research, it appears to me that the recommendation to Google to use the ordinary second-price (Vickrey) setting does not take into account the dynamic nature of the AdWords auctioning. The streamining updating of bids that is done by advertisers probably creates dynamics of its own. So even if they do change it to the eBay-like format, I am doubtful that the results will obey classic auction theory.

## Thursday, February 23, 2006

### Acronyms - in Hebrew???

There are a multitude of performance measures in statistics and data mining. These tend to have acronyms such as MAPE and RMSE. It turns out that even after spelling them out, it is not always obvious to users how they are computed.

Inspired by Don Brown's The Da Vinchi Code, I devised a deciphering method that allows simple computation of these measures. The trick is to read from right-to-left (like Hebrew or Arabic). Here are two examples:

RMSE = Root Mean Squared Error
1. Error: compute the errors (actual value - predicted value)
2. Squared: take a square of each error
3. Mean: take an average of all the squared errors
4. Root: take a square root of the above mean

MAPE = Mean Absolute Percentage Error
1. Error = compute the errors (actual value - predicted value)
2. Percentage = turn each error into a percentage by dividing by the actual value and multiplying by 100%
3. Absolute = take an absolute value of the percentage errors
4. Mean = take an average of the absolute values

## Wednesday, February 15, 2006

### Comparing models with transformations

In the process of searching for a good model, a popular step is to try different transformations of the variables. This can become a bit tricky when we are transforming the response variable, Y.

Consider, for instance, two very simple models for predicting home sales. Let's assume that in both cases we use predictors such as the home's attributes, geographical location, market conditions, time of year, etc. The only difference is that the first model is linear:

(1) SalesPrice = bo + b1 X1 + ...

whereas the second model is exponential:

(2) SalesPrice = exp{c0 + c1 X1 + ...}

The exponential model can also be written as a linear model by taking a natural-log on both sides of the equation:

(2*) log(SalesPrice) = c0 + c1 X1 + ...

Now, let's compare models (1) and (2*). Let's assume that the goal is to achieve a good explanatory model of house prices for this population. Then, after fitting a regression model, we might look at measures such as the R-squared, the standard-error-of-estimate, or even the model residuals. HOWEVER, you will most likely find that model (2*) has a much lower error!

Why? This happens because we are comparing objects that are on two different scales. Model (1)yields errors in \$ units (assuming that the original data are in \$), whereas Model (2*) yields residuals in log(\$) units. A similar distortion will occur if we compare predictive accuracy using measures such as RMSE or MAPE. Standard software output will usually not warn you about this, especially if you created the transformed variable yourself.

So what to do? Compute the predictions of model (2*), then transform them back to the original units. In the above example, we'd take an exponent of the prediction to obtain a \$-valued prediction. Then, compute residuals by comparing the re-scaled predictions to the actual y-values. These will be comprabale to a model with no transformation. You can even compare the re-scaled predictions with those from model (1) or any other model that has re-scaled predictions.

The unfortunate part is that you'll probably have to compute all the goodness-of-fit or predictive accuracy measures yourself, using the re-scaled residuals. But that's usually not too hard.

## Tuesday, February 14, 2006

### Data partitioning

A central initial step in data mining is to partition the data into two or three partitions. The first partition is called the training set, the second is the validation set, and if there is a third, it is usually called the test set.

The purpose of data partitioning is to enable evaluating model predictive performance. In contrast to an explanatory goal, where we want to fit the data as closely as possible, good predictive models are those that have high predictive accuracy. Now, if we fit a model to data, then obviously the "tighter" the model, the better it will predict those data. But what about new data? How well will the model predict those?

Predictive models are different from explanatory models in various aspects. But let's only focus on performance evaluation here. Indications of good model fit are usually high R-squared values, low standard-error-of-estimate, etc. These do not measure predictive accuracy.

So how does partioning help measure predictive performance? The training set is first used to fit a model (also called to "train the model".) The validation set is then used to evaluate model performance on new data that it did not "see". At this stage we compare the model predictions for the new validation data to the actual values and use different metrics to quantify predictive accuracy.

Sometimes, we actually use the validation set to tweak the original model. In other words, after seeing how the model performed on the validation data, we might go back and change the model. In that case we are "using" our validation data, and the model is no longer blind to them. This is when a third, test set, comes in handy. The final evaluation of predictive performance is then achieved by applying the model (which is based on the training data and tweaked using the validation data) to the test data that it never "saw".

## Thursday, February 09, 2006

### Translate "odds"

Odds are a technical term that is often used in horse or car racing. It refers to the ratio p/(1-p) where p is the probability of success. So for instance, a 1:3 odds of winning is equivalent to a probability of 0.25 of winning.

What I found odd is that the term "odds" in this meaning does not exist in most languages! Usually, the closest you can get is "proabbility" or "chance". I first realized it when I tried to translate to Hebrew. Then, students who speak other languages (Spanish, Russian, Chinese) said that is the case in other languates as well.

Odds are important in data mining because the are the basis of logistic regression, a very popular classification method. Say we want to predict the probability that a customer will default on a loan, using information on historic transactions, demographics, etc. A logistic regression models the odds of defaulting as an exponential function of the predictors (or, equivalently, the log-odds are writted as a linear function of the predictors). The interpretation of coefficients in a logistic model are usually in terms of odds (e.g., "single customers are on average 1.5 times more likely to default than married customers, all else equal".)

A frequent terminological error when it comes to odds: Sometimes odds are referred to as "odds ratios". This is a mistake that probably comes from the fact that odds are a ratio (of probabilities). But in fact, an odds ratio is a ratio of odds. These are used to compare the odds of two groups. For example, if we compare the loan defaulting odds of males and females (e.g., via a "Gender" predictor in the logistic regression), then we have an odds ratio.

Does anyone know of a language that does have the term "odds"?

## Tuesday, February 07, 2006

### The "G" word

I use "G Shmueli" in my slides and in my email signature. This is not about that "G".

It usually surprises students when I say that most of the data analysis should be spent on data exploration rather than modeling. Whether it is for the sake of statistical testing, prediction of new records, or finding a model that helps understand the data structure, the most useful tool is GRAPHS and summaries. Data visualization is so important that in a sense, the models that follow will usually only confirm what we see.

A few points:
1. Good visualization tools are those that have high-quality graphics, are interactive, user-friendly, and can integrate many pieces of information. Excel is an example of a very low-level tool. It's graphs are usually very bad and require a lot of formatting (who needs a graph with gray background and horizontal lines???) A terrific tool which I discovered a few years ago is Spotfire. It is an interactive visualization tool that allows the user to browse the data from multiple point of view, using color, shape, size and more to visualize multidimentional data. When I show this tool, the class usually hisses "wowwwwwww"

2. Even when we're talking about huge datasets, visualization is still very useful. Of course if you try to create a scatterplot of income vs. age for a 1,000,000 customer database your screen will be black and perhaps your computer will freeze. The way to go is to sample from the database. A good random sample will give an adequation picture. You can also take a few other samples to verify that what you are seeing is consistent.

3. When deciding which plots to create, think about the goal of the analysis. For example, if we are trying to classify customers as buyers/non-buyers, we'd be interested in plots that compare the buyers to the non-buyers.

## Thursday, February 02, 2006

### What is Bzst?

Statistics in Business. That's what it's all about. And BusinessWeek just revealed our real secret - "Statistics is becoming core skills for businesspeople and consumers... Winners will know how to use statistics - and how to spot when others are dissembling" (Why Math Will Rock Your World, 1/23/2006)

So I no longer need to arch my shoulders and shrink when asked "what do you teach?"

I've been teaching statistics for more than a decade now. Until 2002 I taught mainly engineering students. And then it was called "statistics". Then, I moved to the Robert H Smith School of Business, and started teaching "data analysis". And now it's "data mining", "business analytics", "business intelligence", and anything that will keep the fear level down.

But in truth, the use of statistical thinking in business is exciting, fruitful, and extremely powerful. Our MBA elective class "Data Analysis for Decision Makers" has grown to parallel sessions, wait-lists, and some very happy MBAs. The reason is simple: the statistical thinking and toolkit is a necessity for excelling in business analytics.

I plan to post on a variety of issues that relate to statistics in business, teaching statistics and data mining, and more. You are all welcome to post replies!