In this week's issue of BusinessWeek (March 6, 2006), an article called The secret to Google's success describes a study by three economists showing that Google's mechanism for auctioning ad space (called AdWords), which is supposed to be a second-price auction, actually "differs in a key respect from the one economists had studied".
I tracked down a report on this study ("The high price of internet keyword auctions" by Edelman, Ostrovsky, and Schwarz) to find out more. And I found out something that is directly related to our work on eBay auctions...
Starting from the basics, a second-price auction is one where the highest bidder is the winner, and s/he pays the second highest bid (+ a small increment). This is also the format used in most of eBay's auctions. According to auction theory (derived from game theory), in a second-price auction the optimal bidding strategy should be to bid your true valuation. If you think the item is worth $100, just bid $100. Going back to the study, the economists found that the mechanism used by Google's AdWords does NOT lead to this "truth telling". Instead, sophisticated users actually tend to under-bid.
The authors' recommendation is "that search engines consider adopting a true Vickrey setup... [where] the system and bids would remain relatively static, changing only when economic fundamentals changed". And this is where I disagree: I have been conducting empirical research of online auctions from a different, non-economist perspective. Instead of starting from economic theory and trying to see how it manifests in the online setting, I examine the online setting and try to characterize it using statistical tools. An important hypothesis that my colleague Wolfgang Jank and I have is that the auction price is influenced not only by factors that economic theory sets (like the opening price and number of bidders), but also by the dynamics that take place during the auction. The online environment has very different dynamics than the older offline version. Think of the psychology that goes on when you are bidding for an item on eBay. This means that perhaps classic auction theory does not account for new factors that might determine the final price. Of course, this claim has always won us some frowns from hard-core economists...
For example, many empirical researchers have found in eBay (second-price) auctions that bidders do not follow the "optimal bidding strategy" of bidding their truthful valuation. In fact, on eBay many bidders tend to revise their bids as the auction proceeds (the phenomenon of last-moment-bidding, or "sniping", is also related). There have been different attempts to explain this through economic theory, but there hasn't been one compelling answer.
In light of my eBay research, it appears to me that the recommendation to Google to use the ordinary second-price (Vickrey) setting does not take into account the dynamic nature of the AdWords auctioning. The streamining updating of bids that is done by advertisers probably creates dynamics of its own. So even if they do change it to the eBay-like format, I am doubtful that the results will obey classic auction theory.
Monday, February 27, 2006
Thursday, February 23, 2006
Acronyms - in Hebrew???
There are a multitude of performance measures in statistics and data mining. These tend to have acronyms such as MAPE and RMSE. It turns out that even after spelling them out, it is not always obvious to users how they are computed.
Inspired by Don Brown's The Da Vinchi Code, I devised a deciphering method that allows simple computation of these measures. The trick is to read from right-to-left (like Hebrew or Arabic). Here are two examples:
RMSE = Root Mean Squared Error
1. Error: compute the errors (actual value - predicted value)
2. Squared: take a square of each error
3. Mean: take an average of all the squared errors
4. Root: take a square root of the above mean
MAPE = Mean Absolute Percentage Error
1. Error = compute the errors (actual value - predicted value)
2. Percentage = turn each error into a percentage by dividing by the actual value and multiplying by 100%
3. Absolute = take an absolute value of the percentage errors
4. Mean = take an average of the absolute values
Inspired by Don Brown's The Da Vinchi Code, I devised a deciphering method that allows simple computation of these measures. The trick is to read from right-to-left (like Hebrew or Arabic). Here are two examples:
RMSE = Root Mean Squared Error
1. Error: compute the errors (actual value - predicted value)
2. Squared: take a square of each error
3. Mean: take an average of all the squared errors
4. Root: take a square root of the above mean
MAPE = Mean Absolute Percentage Error
1. Error = compute the errors (actual value - predicted value)
2. Percentage = turn each error into a percentage by dividing by the actual value and multiplying by 100%
3. Absolute = take an absolute value of the percentage errors
4. Mean = take an average of the absolute values
Wednesday, February 15, 2006
Comparing models with transformations
In the process of searching for a good model, a popular step is to try different transformations of the variables. This can become a bit tricky when we are transforming the response variable, Y.
Consider, for instance, two very simple models for predicting home sales. Let's assume that in both cases we use predictors such as the home's attributes, geographical location, market conditions, time of year, etc. The only difference is that the first model is linear:
(1) SalesPrice = bo + b1 X1 + ...
whereas the second model is exponential:
(2) SalesPrice = exp{c0 + c1 X1 + ...}
The exponential model can also be written as a linear model by taking a natural-log on both sides of the equation:
(2*) log(SalesPrice) = c0 + c1 X1 + ...
Now, let's compare models (1) and (2*). Let's assume that the goal is to achieve a good explanatory model of house prices for this population. Then, after fitting a regression model, we might look at measures such as the R-squared, the standard-error-of-estimate, or even the model residuals. HOWEVER, you will most likely find that model (2*) has a much lower error!
Why? This happens because we are comparing objects that are on two different scales. Model (1)yields errors in $ units (assuming that the original data are in $), whereas Model (2*) yields residuals in log($) units. A similar distortion will occur if we compare predictive accuracy using measures such as RMSE or MAPE. Standard software output will usually not warn you about this, especially if you created the transformed variable yourself.
So what to do? Compute the predictions of model (2*), then transform them back to the original units. In the above example, we'd take an exponent of the prediction to obtain a $-valued prediction. Then, compute residuals by comparing the re-scaled predictions to the actual y-values. These will be comprabale to a model with no transformation. You can even compare the re-scaled predictions with those from model (1) or any other model that has re-scaled predictions.
The unfortunate part is that you'll probably have to compute all the goodness-of-fit or predictive accuracy measures yourself, using the re-scaled residuals. But that's usually not too hard.
Consider, for instance, two very simple models for predicting home sales. Let's assume that in both cases we use predictors such as the home's attributes, geographical location, market conditions, time of year, etc. The only difference is that the first model is linear:
(1) SalesPrice = bo + b1 X1 + ...
whereas the second model is exponential:
(2) SalesPrice = exp{c0 + c1 X1 + ...}
The exponential model can also be written as a linear model by taking a natural-log on both sides of the equation:
(2*) log(SalesPrice) = c0 + c1 X1 + ...
Now, let's compare models (1) and (2*). Let's assume that the goal is to achieve a good explanatory model of house prices for this population. Then, after fitting a regression model, we might look at measures such as the R-squared, the standard-error-of-estimate, or even the model residuals. HOWEVER, you will most likely find that model (2*) has a much lower error!
Why? This happens because we are comparing objects that are on two different scales. Model (1)yields errors in $ units (assuming that the original data are in $), whereas Model (2*) yields residuals in log($) units. A similar distortion will occur if we compare predictive accuracy using measures such as RMSE or MAPE. Standard software output will usually not warn you about this, especially if you created the transformed variable yourself.
So what to do? Compute the predictions of model (2*), then transform them back to the original units. In the above example, we'd take an exponent of the prediction to obtain a $-valued prediction. Then, compute residuals by comparing the re-scaled predictions to the actual y-values. These will be comprabale to a model with no transformation. You can even compare the re-scaled predictions with those from model (1) or any other model that has re-scaled predictions.
The unfortunate part is that you'll probably have to compute all the goodness-of-fit or predictive accuracy measures yourself, using the re-scaled residuals. But that's usually not too hard.
Tuesday, February 14, 2006
Data partitioning
A central initial step in data mining is to partition the data into two or three partitions. The first partition is called the training set, the second is the validation set, and if there is a third, it is usually called the test set.
The purpose of data partitioning is to enable evaluating model predictive performance. In contrast to an explanatory goal, where we want to fit the data as closely as possible, good predictive models are those that have high predictive accuracy. Now, if we fit a model to data, then obviously the "tighter" the model, the better it will predict those data. But what about new data? How well will the model predict those?
Predictive models are different from explanatory models in various aspects. But let's only focus on performance evaluation here. Indications of good model fit are usually high R-squared values, low standard-error-of-estimate, etc. These do not measure predictive accuracy.
So how does partioning help measure predictive performance? The training set is first used to fit a model (also called to "train the model".) The validation set is then used to evaluate model performance on new data that it did not "see". At this stage we compare the model predictions for the new validation data to the actual values and use different metrics to quantify predictive accuracy.
Sometimes, we actually use the validation set to tweak the original model. In other words, after seeing how the model performed on the validation data, we might go back and change the model. In that case we are "using" our validation data, and the model is no longer blind to them. This is when a third, test set, comes in handy. The final evaluation of predictive performance is then achieved by applying the model (which is based on the training data and tweaked using the validation data) to the test data that it never "saw".
The purpose of data partitioning is to enable evaluating model predictive performance. In contrast to an explanatory goal, where we want to fit the data as closely as possible, good predictive models are those that have high predictive accuracy. Now, if we fit a model to data, then obviously the "tighter" the model, the better it will predict those data. But what about new data? How well will the model predict those?
Predictive models are different from explanatory models in various aspects. But let's only focus on performance evaluation here. Indications of good model fit are usually high R-squared values, low standard-error-of-estimate, etc. These do not measure predictive accuracy.
So how does partioning help measure predictive performance? The training set is first used to fit a model (also called to "train the model".) The validation set is then used to evaluate model performance on new data that it did not "see". At this stage we compare the model predictions for the new validation data to the actual values and use different metrics to quantify predictive accuracy.
Sometimes, we actually use the validation set to tweak the original model. In other words, after seeing how the model performed on the validation data, we might go back and change the model. In that case we are "using" our validation data, and the model is no longer blind to them. This is when a third, test set, comes in handy. The final evaluation of predictive performance is then achieved by applying the model (which is based on the training data and tweaked using the validation data) to the test data that it never "saw".
Thursday, February 09, 2006
Translate "odds"
Odds are a technical term that is often used in horse or car racing. It refers to the ratio p/(1-p) where p is the probability of success. So for instance, a 1:3 odds of winning is equivalent to a probability of 0.25 of winning.
What I found odd is that the term "odds" in this meaning does not exist in most languages! Usually, the closest you can get is "proabbility" or "chance". I first realized it when I tried to translate to Hebrew. Then, students who speak other languages (Spanish, Russian, Chinese) said that is the case in other languates as well.
Odds are important in data mining because the are the basis of logistic regression, a very popular classification method. Say we want to predict the probability that a customer will default on a loan, using information on historic transactions, demographics, etc. A logistic regression models the odds of defaulting as an exponential function of the predictors (or, equivalently, the log-odds are writted as a linear function of the predictors). The interpretation of coefficients in a logistic model are usually in terms of odds (e.g., "single customers are on average 1.5 times more likely to default than married customers, all else equal".)
A frequent terminological error when it comes to odds: Sometimes odds are referred to as "odds ratios". This is a mistake that probably comes from the fact that odds are a ratio (of probabilities). But in fact, an odds ratio is a ratio of odds. These are used to compare the odds of two groups. For example, if we compare the loan defaulting odds of males and females (e.g., via a "Gender" predictor in the logistic regression), then we have an odds ratio.
Does anyone know of a language that does have the term "odds"?
What I found odd is that the term "odds" in this meaning does not exist in most languages! Usually, the closest you can get is "proabbility" or "chance". I first realized it when I tried to translate to Hebrew. Then, students who speak other languages (Spanish, Russian, Chinese) said that is the case in other languates as well.
Odds are important in data mining because the are the basis of logistic regression, a very popular classification method. Say we want to predict the probability that a customer will default on a loan, using information on historic transactions, demographics, etc. A logistic regression models the odds of defaulting as an exponential function of the predictors (or, equivalently, the log-odds are writted as a linear function of the predictors). The interpretation of coefficients in a logistic model are usually in terms of odds (e.g., "single customers are on average 1.5 times more likely to default than married customers, all else equal".)
A frequent terminological error when it comes to odds: Sometimes odds are referred to as "odds ratios". This is a mistake that probably comes from the fact that odds are a ratio (of probabilities). But in fact, an odds ratio is a ratio of odds. These are used to compare the odds of two groups. For example, if we compare the loan defaulting odds of males and females (e.g., via a "Gender" predictor in the logistic regression), then we have an odds ratio.
Does anyone know of a language that does have the term "odds"?
Tuesday, February 07, 2006
The "G" word
I use "G Shmueli" in my slides and in my email signature. This is not about that "G".
It usually surprises students when I say that most of the data analysis should be spent on data exploration rather than modeling. Whether it is for the sake of statistical testing, prediction of new records, or finding a model that helps understand the data structure, the most useful tool is GRAPHS and summaries. Data visualization is so important that in a sense, the models that follow will usually only confirm what we see.
A few points:
1. Good visualization tools are those that have high-quality graphics, are interactive, user-friendly, and can integrate many pieces of information. Excel is an example of a very low-level tool. It's graphs are usually very bad and require a lot of formatting (who needs a graph with gray background and horizontal lines???) A terrific tool which I discovered a few years ago is Spotfire. It is an interactive visualization tool that allows the user to browse the data from multiple point of view, using color, shape, size and more to visualize multidimentional data. When I show this tool, the class usually hisses "wowwwwwww"
2. Even when we're talking about huge datasets, visualization is still very useful. Of course if you try to create a scatterplot of income vs. age for a 1,000,000 customer database your screen will be black and perhaps your computer will freeze. The way to go is to sample from the database. A good random sample will give an adequation picture. You can also take a few other samples to verify that what you are seeing is consistent.
3. When deciding which plots to create, think about the goal of the analysis. For example, if we are trying to classify customers as buyers/non-buyers, we'd be interested in plots that compare the buyers to the non-buyers.
It usually surprises students when I say that most of the data analysis should be spent on data exploration rather than modeling. Whether it is for the sake of statistical testing, prediction of new records, or finding a model that helps understand the data structure, the most useful tool is GRAPHS and summaries. Data visualization is so important that in a sense, the models that follow will usually only confirm what we see.
A few points:
1. Good visualization tools are those that have high-quality graphics, are interactive, user-friendly, and can integrate many pieces of information. Excel is an example of a very low-level tool. It's graphs are usually very bad and require a lot of formatting (who needs a graph with gray background and horizontal lines???) A terrific tool which I discovered a few years ago is Spotfire. It is an interactive visualization tool that allows the user to browse the data from multiple point of view, using color, shape, size and more to visualize multidimentional data. When I show this tool, the class usually hisses "wowwwwwww"
2. Even when we're talking about huge datasets, visualization is still very useful. Of course if you try to create a scatterplot of income vs. age for a 1,000,000 customer database your screen will be black and perhaps your computer will freeze. The way to go is to sample from the database. A good random sample will give an adequation picture. You can also take a few other samples to verify that what you are seeing is consistent.
3. When deciding which plots to create, think about the goal of the analysis. For example, if we are trying to classify customers as buyers/non-buyers, we'd be interested in plots that compare the buyers to the non-buyers.
Thursday, February 02, 2006
What is Bzst?
Statistics in Business. That's what it's all about. And BusinessWeek just revealed our real secret - "Statistics is becoming core skills for businesspeople and consumers... Winners will know how to use statistics - and how to spot when others are dissembling" (Why Math Will Rock Your World, 1/23/2006)
So I no longer need to arch my shoulders and shrink when asked "what do you teach?"
I've been teaching statistics for more than a decade now. Until 2002 I taught mainly engineering students. And then it was called "statistics". Then, I moved to the Robert H Smith School of Business, and started teaching "data analysis". And now it's "data mining", "business analytics", "business intelligence", and anything that will keep the fear level down.
But in truth, the use of statistical thinking in business is exciting, fruitful, and extremely powerful. Our MBA elective class "Data Analysis for Decision Makers" has grown to parallel sessions, wait-lists, and some very happy MBAs. The reason is simple: the statistical thinking and toolkit is a necessity for excelling in business analytics.
I plan to post on a variety of issues that relate to statistics in business, teaching statistics and data mining, and more. You are all welcome to post replies!
So I no longer need to arch my shoulders and shrink when asked "what do you teach?"
I've been teaching statistics for more than a decade now. Until 2002 I taught mainly engineering students. And then it was called "statistics". Then, I moved to the Robert H Smith School of Business, and started teaching "data analysis". And now it's "data mining", "business analytics", "business intelligence", and anything that will keep the fear level down.
But in truth, the use of statistical thinking in business is exciting, fruitful, and extremely powerful. Our MBA elective class "Data Analysis for Decision Makers" has grown to parallel sessions, wait-lists, and some very happy MBAs. The reason is simple: the statistical thinking and toolkit is a necessity for excelling in business analytics.
I plan to post on a variety of issues that relate to statistics in business, teaching statistics and data mining, and more. You are all welcome to post replies!
Subscribe to:
Posts (Atom)