Tuesday, September 26, 2006

Cheating in MBA programs

First, Noah Kauffman, an ex-MBA student of mine emailed me the story, then I found it in BusinessWeek, and a quick search brought up the story in many news sources, university websites, and magazines. Each had a different title. Here are some examples:
MBA Students Are No. 1 - At Cheating (BusinessWeek, Oct 2 issue, page 14)
A Crooked Path Through B-School (BusinessWeek Online)
Study: Majority of Graduate Business Students Admit to Cheating (Penn State's Smeal School of Business News and Media Resources)
MBA Students Likelier to Cheat (Toronto Star)
National survey: MBA cheating prevalent (The Cavalier Daily)

All sources report the following about the study:

Linda TreviƱo, Franklin H. Cook Fellow in Business Ethics at Penn State's Smeal College, and her colleagues Donald McCabe of Rutgers and Kenneth Butterfield of Washington State examined survey results from 5,331 students at 32 graduate schools in Canada and the United States. They found that 56 percent of graduate business school students admitted to cheating one or more times in the past academic year compared to 47 percent of non-business students.

All of the articles that I found discuss the reasons and possible solutions to the cheating. But none describe more details about the study itself. So let's look at the numbers and what the research question is. First, we learn that the study compared the rate of business to non-business students who admit to cheating. The sample estimates were 56% for MBAs vs. 47% for non-MBAs. Does this sample difference generalize to the entire population of graduate students? Can we say that in general MBAs cheat more than other grad students? To find out, we need to know the breakdown of the sample (n=5331) into business and non-business students. Since I couldn't find it, let's go in the reverse direction -- what type of breakdown would lead us to believe in a real difference between the proportion of MBA vs. other grad cheaters?

You might recall from Stat101 a procedure for comparing proportions from two independent samples. To use this, we must assume that the MBAs and other grads consist of two independent samples (e.g., there were no MBAs who were also studying towards a different graduate degree). In that case, we take the difference between the sample proportions: 0.56-0.47=0.09 and see far it is from zero, in standard errors. To compute the standard error we use the formula:
standard error = square-root{ p (1-p) (1/n1 + 1/n2) }

where n1 is the sample size of MBAs, n2 is the sample size of non-MBAs, and p is the weighted average of 0.47 and 0.56, weighted by the corresponding sample sizes. Since we don't know n1 or n2, I tried different values (remember that n1+n2=5331, so I only have to set n1). Here is what I get:

If the samples are relatively balanced (e.g., n1=2600 and n2=2731), then the distance between the MBA and non-MBA proportion of cheaters is more than 6 standard errors! This is a pretty compelling distance, that supports the study's claim. If, on the other hand, the samples are very imbalanced, then we can get opposite results. For example, if the MBA sample had n1=100 students and the non-MBA sample had n2=5231 students, then the difference between 47% and 56% is less than 2 standard errors, which might be considered too weak of an evidence.

The bottom line is that we really want to know more about the numbers from the study. Besides the breakdown of MBA and non-MBA samples, what was the response rate to the survey? Did all 5331 students reply? How were the samples drawn from the population of b-schools and other graduate programs? etc.

I guess we'll have to wait for the article, entitled "Academic dishonesty in graduate business programs: Prevalence, Causes and Proposed Action", which will be published in a forthcoming issue of the Academy of Management Learning and Education.

Thursday, September 21, 2006

Dylan on data exploration

The ease of use of many data analysis and data mining software packages has lead to the dangerous tendency to jump to the model fitting stage without proper data exploration. Getting an initial understanding of the data via summarization and visualization is crucial for building good models.

Mike Melcer, a current MBA student in my data mining class, mentioned that Bob Dylan knew this well. He sings You don't need a weatherman to know which way the wind blows (from Subterranean Homesick Blues). The weatherman can, however, quantify the speed of the wind and the temperature. In other words, the modeling phase is there to formalize and quantify what you learn in the data exploration phase. But you do have to stick your head out of the window first.

Tuesday, September 12, 2006

What are decision trees?

The term "decision tree" has been used in two very different contexts, which causes some confusion. In the context of decision sciences (or decision making), it means a tree structure that assist in decision making, by mapping the different courses of action and assigning costs and probabilities to the different scenarios. There is a good description on MindTools website.

In contrast, "decision trees" are also a popular name for classification trees (or regression trees), a data mining method for predicting an outcome from a set of predictor variables (see, for example, the description on Resample.com). Two well-known types of classification tree algorithms are CART (implemented in software such as CART, SAS Enterprise Miner, and the Excel add-on XLMiner) and C4.5 (implemented in SPSS). An alternative algorithm, which is more statistically oriented and widely used in marketing, is CHAID (implemented in multiple software packages).

Both types of decision trees are tools that are very useful in business applications and decision making. They both use a tree-structure and can generate rules. But otherwise, they are quite different in what they are used for, and how they operate. The decision-sciences decision tree relies on the expert to build the scenarios, assess costs and probabilities of events. In contrast the data-mining decision tree uses a large database of historic data to come up with rules that relate an outcome of interest with a set of predictor variables.

To see how much of a confusion the use of the same term for the two tools causes, check out the definition of Decision Tree in wikipedia. The first paragraph refers to decision theory, while all the rest is the data mining version... So next time, when decision trees are mentioned, make sure sure to find out which tool they are talking about!

Wednesday, August 16, 2006

Webcast on Aug 17: Teaching Analytics in the B-School

Everything you wanted to know about teaching data mining at the B-school!

On Aug 17 at 13:00 my colleague Ravi Bapna and I will be hosted on a SAS webcast on teaching analytics in the business school. We will discuss the skills that students in such courses obtain and the growing demand in the market; teaching approaches and how to go about teaching such a course; how it ties to research and corporate involvement, and more.

To view and participate in the webcast, you can register at http://www.sas.com/govedu/events/112592/index.html . The webcast will also be archived and freely available later on the SAS website.

Sunday, July 30, 2006

Summer break

Please hold your breath for a little longer until I retain full speed and continue posting to Bzst. Even statisticians need a break! In the meantime, I'll just report that:

1. My evening data-mining for MBAs class that I will be teaching in Fall is almost full.

2. The textbook "Data Mining for Business Intelligence" that I co-authored is in press. But you can get a sneak preview at www.dataminingbook.com.

Sunday, May 21, 2006

Data Mining for Business Applications Workshop

The upcoming International Conference on Knowledge Discovery and Data Mining (KDD) conference (August in Philadelphia) will feature a workshop on "Data Mining for Business Applications". The goals of the workshop are stated as:

1. Bring together researchers (from both academia and industry) as well as practitioners from different fields to talk about their different perspectives and to share their latest problems and ideas.
2. Attract business professionals who have access to interesting sources of data and business problems but not the expertise in data mining to solve them effectively.

I love attending KDD - it is a fun conference with lots of interesting talks and posters, which attracts both industry people as well as academics from artificial intelligence/maching-learning and a few statisticians (the cool ones, of course). Aside from the main conference there is a variety of workshops and tutorials. This conference has a competitive acceptance rate for papers, which guarantees high quality.

See you in Philly!

Bridging academia and industry

The latest AMSTAT NEWS, which is the monthly magazine of the American Statistical Association has an interesting article by Bonnie Ray, a statistician at IBM Watson Research Center. She describes the wealth of activities (sections, conferences, etc.) by the sister organization INFORMS that are aimed at bringing together academics with industry professionals. In particular, she mentions the huge gap in the field of business and the burning need for quantitative and "statistically literate" experts in businesses.

I believe that one GREAT resource is the MBA program. Some of the students who take (in addition to a core statistics course) a hands-on, business-oriented data mining/analysis course have a big advantage: they not only understand and tried out some analysis, but they are well versed in the business world, in their field of concentration (marketing, finance, etc.) Some of my top students would be an incredible asset to any company.

It is prime time for the statistics community to embrace MBA programs and not only teach statistics, but also learn more about its use, challenges, and real applications in the business context.