## Tuesday, September 15, 2009

### Interpreting log-transformed variables in linear regression

Statisticians love variable transformations. log-em, square-em, square-root-em, or even use the all-encompassing Box-Cox transformation, and voilla: you get variables that are "better behaved". Good behavior to statistician parents means things like kids with normal behavior (=normally distributed) and stable variance. Transformations are often used in order to be able to use popular tools such as linear regression, where the underlying assumptions require "well-behaved" variables.

Moving into the world of business, one transformation is more than just a "statistical technicality": the log transform. It turns out that taking a log function of the inputs (X's) and/or output (Y) variables in linear regression yields meaningful, interpretable relationships (there seems to be a misconception that linear regression is only useful for modeling a linear input-output relationship, but the truth is that the name "linear" describes the linear relationship between Y and the coefficients... very confusing indeed, and the fault of statisticians, of course!). Using log transforms enables modeling a wide range of meaningful, useful, non-linear relationships between inputs and outputs. Using a log-transform moves from unit-based interpretations to percentage-based interpretations.

So let's see how the log-transform works for linear regression interpretations.
Note: I use "log" to denote "log base e" (also known as "ln", or in Excel the function "=LN"). You can do the same with log base 10, but the interpretations are not as slick.

Let's start with a linear relationship between X and Y of the form (ignoring the noise part for simplicity):
Y = a + b X
The interpretation of b is: a unit increase in X is associated with an average of b units increase in Y.

Now, let's assume an exponential relationship of the form: Y = a exp(b X)
If we take logs on both sides we get: log(Y) = c + b X
The interpretation of b is:  a unit increase in X in associated with an average of 100b percent increase in Y. This approximate interpretation works well for |b|<0.1. Otherwise, the exact relationship is: a unit increase in X is associated with an average increase of 100(exp(b)-1) percent.

Techical explanation:
Take a derivative of the last equation with respect to X (to denot a small increase in X). You get
1/Y dY/dx = b,  or equivalently,  dY/Y = b dX.
dX means a small increase in X, and dY is the associated increase in Y. The quantity dY/Y is a small proportional increase in Y (so 100 time dY/Y is a small percentage increase in Y). Hence, a small unit increase in X is associated with an average increase of 100b% increase in Y.

Another popular non-linear relationship is a log-relationship of the form: Y = a + b log(X)
Here the (approximate) interpretation of b is: a 1% increase in X is associated with an average b/100 units increase in Y. (Use the same steps in the previous technical explanation to get this result). The approximate interpretation is fairly accurate (the exact interpretation is: a 1% increase in X is associated with an average increase of (b)(log(1.01)) in Y, but log(1.01) is practically 0.01).

Finally, another very common relationship in business is completely multiplicative: Y = a Xb. If we take logs here we get log(Y) = c + b log(X).
The approximate interpretation of b is: a 1% increase in X is associated with a b% increase in Y. Like the exponential model, the approximate interpretation works for |b|>0.1, and otherwise the exact interpretation is: a 1% increase in X is associated with an average 100*exp(d log(1.01)-1) percent increase in Y.

Finally, note that although I've described a relationship between Y and a single X, all this can be extended to multiple X's. For example, to a multiplicative model such as: Y = a X1X2X3.

Although this stuff is extremely useful, it is not easily found in many textbooks. Hence this post. I did find a good description in the book Regression methods in biostatistics: linear, logistic, survival, and repeated models by Vittinghoff et al. (see the relevant pages in Google books).