Tuesday, September 26, 2006

Cheating in MBA programs

First, Noah Kauffman, an ex-MBA student of mine emailed me the story, then I found it in BusinessWeek, and a quick search brought up the story in many news sources, university websites, and magazines. Each had a different title. Here are some examples:
MBA Students Are No. 1 - At Cheating (BusinessWeek, Oct 2 issue, page 14)
A Crooked Path Through B-School (BusinessWeek Online)
Study: Majority of Graduate Business Students Admit to Cheating (Penn State's Smeal School of Business News and Media Resources)
MBA Students Likelier to Cheat (Toronto Star)
National survey: MBA cheating prevalent (The Cavalier Daily)

All sources report the following about the study:

Linda TreviƱo, Franklin H. Cook Fellow in Business Ethics at Penn State's Smeal College, and her colleagues Donald McCabe of Rutgers and Kenneth Butterfield of Washington State examined survey results from 5,331 students at 32 graduate schools in Canada and the United States. They found that 56 percent of graduate business school students admitted to cheating one or more times in the past academic year compared to 47 percent of non-business students.

All of the articles that I found discuss the reasons and possible solutions to the cheating. But none describe more details about the study itself. So let's look at the numbers and what the research question is. First, we learn that the study compared the rate of business to non-business students who admit to cheating. The sample estimates were 56% for MBAs vs. 47% for non-MBAs. Does this sample difference generalize to the entire population of graduate students? Can we say that in general MBAs cheat more than other grad students? To find out, we need to know the breakdown of the sample (n=5331) into business and non-business students. Since I couldn't find it, let's go in the reverse direction -- what type of breakdown would lead us to believe in a real difference between the proportion of MBA vs. other grad cheaters?

You might recall from Stat101 a procedure for comparing proportions from two independent samples. To use this, we must assume that the MBAs and other grads consist of two independent samples (e.g., there were no MBAs who were also studying towards a different graduate degree). In that case, we take the difference between the sample proportions: 0.56-0.47=0.09 and see far it is from zero, in standard errors. To compute the standard error we use the formula:
standard error = square-root{ p (1-p) (1/n1 + 1/n2) }

where n1 is the sample size of MBAs, n2 is the sample size of non-MBAs, and p is the weighted average of 0.47 and 0.56, weighted by the corresponding sample sizes. Since we don't know n1 or n2, I tried different values (remember that n1+n2=5331, so I only have to set n1). Here is what I get:

If the samples are relatively balanced (e.g., n1=2600 and n2=2731), then the distance between the MBA and non-MBA proportion of cheaters is more than 6 standard errors! This is a pretty compelling distance, that supports the study's claim. If, on the other hand, the samples are very imbalanced, then we can get opposite results. For example, if the MBA sample had n1=100 students and the non-MBA sample had n2=5231 students, then the difference between 47% and 56% is less than 2 standard errors, which might be considered too weak of an evidence.

The bottom line is that we really want to know more about the numbers from the study. Besides the breakdown of MBA and non-MBA samples, what was the response rate to the survey? Did all 5331 students reply? How were the samples drawn from the population of b-schools and other graduate programs? etc.

I guess we'll have to wait for the article, entitled "Academic dishonesty in graduate business programs: Prevalence, Causes and Proposed Action", which will be published in a forthcoming issue of the Academy of Management Learning and Education.

Thursday, September 21, 2006

Dylan on data exploration

The ease of use of many data analysis and data mining software packages has lead to the dangerous tendency to jump to the model fitting stage without proper data exploration. Getting an initial understanding of the data via summarization and visualization is crucial for building good models.

Mike Melcer, a current MBA student in my data mining class, mentioned that Bob Dylan knew this well. He sings You don't need a weatherman to know which way the wind blows (from Subterranean Homesick Blues). The weatherman can, however, quantify the speed of the wind and the temperature. In other words, the modeling phase is there to formalize and quantify what you learn in the data exploration phase. But you do have to stick your head out of the window first.

Tuesday, September 12, 2006

What are decision trees?

The term "decision tree" has been used in two very different contexts, which causes some confusion. In the context of decision sciences (or decision making), it means a tree structure that assist in decision making, by mapping the different courses of action and assigning costs and probabilities to the different scenarios. There is a good description on MindTools website.

In contrast, "decision trees" are also a popular name for classification trees (or regression trees), a data mining method for predicting an outcome from a set of predictor variables (see, for example, the description on Resample.com). Two well-known types of classification tree algorithms are CART (implemented in software such as CART, SAS Enterprise Miner, and the Excel add-on XLMiner) and C4.5 (implemented in SPSS). An alternative algorithm, which is more statistically oriented and widely used in marketing, is CHAID (implemented in multiple software packages).

Both types of decision trees are tools that are very useful in business applications and decision making. They both use a tree-structure and can generate rules. But otherwise, they are quite different in what they are used for, and how they operate. The decision-sciences decision tree relies on the expert to build the scenarios, assess costs and probabilities of events. In contrast the data-mining decision tree uses a large database of historic data to come up with rules that relate an outcome of interest with a set of predictor variables.

To see how much of a confusion the use of the same term for the two tools causes, check out the definition of Decision Tree in wikipedia. The first paragraph refers to decision theory, while all the rest is the data mining version... So next time, when decision trees are mentioned, make sure sure to find out which tool they are talking about!