Tuesday, September 15, 2009

Interpreting log-transformed variables in linear regression

Statisticians love variable transformations. log-em, square-em, square-root-em, or even use the all-encompassing Box-Cox transformation, and voilla: you get variables that are "better behaved". Good behavior to statistician parents means things like kids with normal behavior (=normally distributed) and stable variance. Transformations are often used in order to be able to use popular tools such as linear regression, where the underlying assumptions require "well-behaved" variables.

Moving into the world of business, one transformation is more than just a "statistical technicality": the log transform. It turns out that taking a log function of the inputs (X's) and/or output (Y) variables in linear regression yields meaningful, interpretable relationships (there seems to be a misconception that linear regression is only useful for modeling a linear input-output relationship, but the truth is that the name "linear" describes the linear relationship between Y and the coefficients... very confusing indeed, and the fault of statisticians, of course!). Using log transforms enables modeling a wide range of meaningful, useful, non-linear relationships between inputs and outputs. Using a log-transform moves from unit-based interpretations to percentage-based interpretations.

So let's see how the log-transform works for linear regression interpretations.
Note: I use "log" to denote "log base e" (also known as "ln", or in Excel the function "=LN"). You can do the same with log base 10, but the interpretations are not as slick.

Let's start with a linear relationship between X and Y of the form (ignoring the noise part for simplicity):
Y = a + b X
The interpretation of b is: a unit increase in X is associated with an average of b units increase in Y.

Now, let's assume an exponential relationship of the form: Y = a exp(b X)
If we take logs on both sides we get: log(Y) = c + b X
The interpretation of b is:  a unit increase in X in associated with an average of 100b percent increase in Y. This approximate interpretation works well for |b|<0.1. Otherwise, the exact relationship is: a unit increase in X is associated with an average increase of 100(exp(b)-1) percent.

Techical explanation:
Take a derivative of the last equation with respect to X (to denot a small increase in X). You get
1/Y dY/dx = b,  or equivalently,  dY/Y = b dX.
dX means a small increase in X, and dY is the associated increase in Y. The quantity dY/Y is a small proportional increase in Y (so 100 time dY/Y is a small percentage increase in Y). Hence, a small unit increase in X is associated with an average increase of 100b% increase in Y.

Another popular non-linear relationship is a log-relationship of the form: Y = a + b log(X)
Here the (approximate) interpretation of b is: a 1% increase in X is associated with an average b/100 units increase in Y. (Use the same steps in the previous technical explanation to get this result). The approximate interpretation is fairly accurate (the exact interpretation is: a 1% increase in X is associated with an average increase of (b)(log(1.01)) in Y, but log(1.01) is practically 0.01).

Finally, another very common relationship in business is completely multiplicative: Y = a Xb. If we take logs here we get log(Y) = c + b log(X).
The approximate interpretation of b is: a 1% increase in X is associated with a b% increase in Y. Like the exponential model, the approximate interpretation works for |b|>0.1, and otherwise the exact interpretation is: a 1% increase in X is associated with an average 100*exp(d log(1.01)-1) percent increase in Y.

Finally, note that although I've described a relationship between Y and a single X, all this can be extended to multiple X's. For example, to a multiplicative model such as: Y = a X1X2X3.

Although this stuff is extremely useful, it is not easily found in many textbooks. Hence this post. I did find a good description in the book Regression methods in biostatistics: linear, logistic, survival, and repeated models by Vittinghoff et al. (see the relevant pages in Google books).

17 comments:

Mahin said...

Kindly pay attention to the last part of the write up on LOG-Log multiplicative model.
I suppose there is a mixed up of information here specially "b" and "d" usage in the equation and the model specification of the multiplicative model. Should the model be written as X1 to the power "b" and X2 to the power C and so on?

Galit Shmueli said...

Thanks for catching this Mahin! I corrected the post.

eric hoo said...

Thank you for the excellent post!

I was wondering how you would interpret the coefficient when it is more than 0.1? I browsed through the book by Vittinghoff too but he did not mention how to deal with this scenario.

Your expert advise on the matter is greatly appreciated.

Galit Shmueli said...

Hi Eric,
As the post says:
"The interpretation of b is: a unit increase in X in associated with an average of 100b percent increase in Y. This approximate interpretation works well for |b|<0.1. Otherwise, the exact relationship is: a unit increase in X is associated with an average increase of 100(exp(b)-1) percent."

eric hoo said...

Sorry, had no idea how I missed that. =)

Anonymous said...

Thanks for the article. Only my right-skewed outcome is log-transformed, the predictor variables are as they are. So, I am interpreting the exponentiated co-efficient of logoutcome as ratio of geometric means of the two predictor groups (eg. female vs. male). I'm multiplying this by 100 to get the outcome for females as percent of outcome for males, on the original scale. Does this sound right? I like your formula too. So, 100(expB-1) would give me an "average percent increase" for each unit increase in predictor, or in switching from female to male (in case of categorical predictors)? Are there any assumptions to be satified for this back-transformation to be valid? My outcome does look log-normal and the residuals of the final model appear more or less normal. Thanks very much for your blog.

Galit Shmueli said...

Hi Grace,
It sounds like you're in the case of "a unit increase in X in associated with an average of 100b percent increase in Y."

If you X = Gender (Male/Female), then the coefficient of Gender, multiplied by 100, would be interpreted as the average difference between Y for males and females.

The best way to make sure your interpretations are correct, is to plot the data. In your example, create histograms or bar charts of Y for males and females (with the means marked). Now compare the two and see if your interpretation makes sense. If you had a bunch of other predictors in the regression, try to "control" for them in the charts by looking at subsets.

Anonymous said...

Thanks. By "average difference", do you mean percent change in mean, and more specifically, geometric mean? Since the absolute value of my co-efficients is large (upto 0.6), I'd like to use the more precise formula: 100(expB-1) percent. How does one calculate the confidence interval for this PERCENT CHANGE? Again, by applying the same formula to the logged CI values? Thanks a bunch! (PS: this is biostats that I'm doing, though I'm not a statistician/biostatistician).

Galit Shmueli said...

Yes, it means the "average percentage difference". The percentage difference between women and men can take different values, so we're talking about the average across these values.

If you compute a confidence interval for a beta parameter, then the CI will also need to be interpreted in the same manner. For a 90% CI, you'll say that you are 90% confident that the average percentage difference between women and men is... (all else constant).

The CI calculation is the same (it has nothing to do with which interpretation you use).

Anonymous said...

Ok, a little confused. The 'average percent difference' in the geometric means of the two groups ?

Or the 'average percent difference' in the average, i.e. means of the two groups?

Thanks!

Galit Shmueli said...

Think of it this way: Suppose GENDER=1 for men, and 0 for women.
If you plug in GENDER=0 with some set of X values into the equation, you get the fitted ln(Y) for women. Then plug in GENDER=1 with the same X values, you get the fitted ln(Y)for men. Note that ln(y) is not the geometric mean (you have to take an exponent to get the geometric mean). The difference between these fitted ln(y) numbers is the coefficient of GENDER.

Suppose the coefficient for GENDER is b_Gender. And y_women is the geometric mean for women.

Then (roughly speaking):
b_GENDER = ln(y_men)- ln(y_women)
and so:

exp(b_GENDER) = y_men / y_women

Hence, the exponent of the coefficient gives you the ratio of the geometric means of men and women.

Check out this newsletter, which uses the term "geometric mean". In particular, they explain:
"when we exponentiate the predicted value of ln(Y), we get the predicted geometric mean of Y rather than the predicted arithmetic mean".
http://www.cscu.cornell.edu/news/statnews/stnews83.pdf

Anonymous said...

Thanks. I understood that the exponentiated co-efficient gives the ratio of the geometric means. I was concerned about interpreting this when one substracts 1 from this and multiplies by 100. Is this ((expB-1)*100) then, the average percent change in "geometric means"? Many thanks for your help.

Galit Shmueli said...

If you are using the geometric mean interpretation, then it is the exact interpretation (you are not doing any approximation to percentages).

Anonymous said...

Ok. I think I get it. Exponentiating and then converting to percent change is simply average percent change in the outcome variable and there's no need to say 'mean' of the outcome variable- just the outcome variable.

THANK you once again for creating this blog and for your help.

Anonymous said...

Sorry to belabor this but if possible, it would be great to know how to calculate the confidence interval for this percent change. Using the 'clparm' option in SAS gives me the 95% CI around the CO-EFFICIENT. I am guessing to obtain the 95% CI around "percent change", I'd have to treat these 2 values like I treated the co-efficient itself, i.e. (ExpLimit1-1)*100, (ExpLimit2-1)*100. Please let me know if this is correct. Google didn't have much to say on this.

Galit Shmueli said...

Yes, that is correct. See the Wikipedia entry on Data transformation. It reads:

"If desired, the confidence interval can then be transformed back to the original scale using the inverse of the transformation that was applied to the data."

Anonymous said...

Thanks*100 !! :D