Linear Regression IV: Transformation

Song of the Day

Song of Scheherazade - Renaissance

Main ideas

Review multiple linear regression.
Discuss transformations.
Review for exam.

Coming up

Exam #02 opens on Thursday at 1:30 PM
Statistical Experience
- No class on Tuesday

Packages

We will use the packages tidyverse and broom.

library(tidyverse)
library(broom)

Data

We will use the air quality dataset from Tuesday’s lecture. Recall airquality.csv contains daily air quality measurements in New York from May to September 1973 and collected by the New York State Department of Conservation and the National Weather Service (Chambers, J.M., Cleveland, W.S., Kleiner, B. and Tukey, P.A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth).

air_quality <- read.csv("data/airquality.csv")

\[y = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon\]

\[\hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k\]

Conditions for Linear Regression
- Linearity: The relationship between response and predictor(s) is linear
- Independence: The residuals are independent
- Normality: The residuals are nearly normally distributed
- Equal Variance: The residuals have constant variance
- and not too much multicollinearity.

Returning to the Air Quality Data

model_ozone <- lm(Ozone ~ Solar.R + Wind + Temp, data = air_quality)
tidy(model_ozone)

## # A tibble: 4 x 5
##   term        estimate std.error statistic       p.value
##   <chr>          <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept) -64.3      23.1        -2.79 0.00623      
## 2 Solar.R       0.0598    0.0232      2.58 0.0112       
## 3 Wind         -3.33      0.654      -5.09 0.00000152   
## 4 Temp          1.65      0.254       6.52 0.00000000242

model_ozone_aug <- augment(model_ozone)

ggplot(model_ozone_aug, 
       mapping = aes(x = .fitted, y = .resid)) + 
  geom_point(alpha = 0.50) + 
  geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) + 
  labs(x = "Predicted", y = "Residuals")

We see that ozone has a right-skewed distribution and a plot of fitted versus residuals shows problems. There is a trend in the residuals and nonconstant variance.

ggplot(model_ozone_aug, mapping = aes(x = Ozone)) +
  geom_histogram()

In these situations a transformation applied to the response variable may be useful. In order to decide which transformation to use, we examine the distribution of the response variable.

The extremely right-skewed distribution suggests a log transformation may be useful. - log = natural log (ln) - default in R is log(x, base = exp(1))

air_quality <- air_quality %>%
  mutate(log_ozone = log(Ozone))

model_log_ozone <- lm(log_ozone ~ Solar.R + Wind + Temp, 
                data = air_quality)
tidy(model_log_ozone)

## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) -0.262    0.554       -0.474 6.37e- 1
## 2 Solar.R      0.00252  0.000557     4.52  1.62e- 5
## 3 Wind        -0.0616   0.0157      -3.92  1.58e- 4
## 4 Temp         0.0492   0.00609      8.08  1.07e-12

model_log_ozone_aug <- augment(model_log_ozone)

ggplot(model_log_ozone_aug, 
       mapping = aes(x = .fitted, y = .resid)) + 
  geom_point(alpha = 0.50) + 
  geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) + 
  labs(x = "Predicted", y = "Residuals")

Linear Model with Log Tranformation

\[ {log(\widehat{Ozone})} = -0.262 + 0.003~SolarR - 0.062~Wind + 0.049~Temp\]

For every additional degree Fahrenheit of temperature, the log ozone is expected to be higher, on average, by 0.049, holding all else constant.
This is not helpful…

Working with logs

Subtraction and logs: \(log(a) - log(b) = log(a / b)\)
Natural logarithm: \(e^{log(x)} = x\)
We can use these identities to “undo” the log transformation

Interpreting models with log transformation

\[\begin{align} 0.049 &= \text{Slope}\\ 0.049 &= \dfrac{log(\text{Ozone at Temp + 1}) - log(\text{Ozone at Temp})}{1}\\ 0.049 &=log\left(\dfrac{\text{Ozone at Temp + 1}}{\text{Ozone at Temp}}\right)\\ e^{0.049} &=e^{log\left(\frac{\text{Ozone at Temp + 1}}{\text{Ozone at Temp}}\right)}\\ 1.05 &\approx \dfrac{\text{Ozone at Temp + 1}}{\text{Ozone at Temp}} \end{align}\]

For every one degree increase in temperature, ozone is expected to be higher, on average, by a factor of \(\text{exp}(0.049) = 1.05\), holding all else constant.

We multiply instead of add.

Interpreting models with log transformation

Alternatively…

\[\begin{align} \widehat{log(Ozone)} &= -0.262 + 0.003~SolarR - 0.062~Wind + 0.049~Temp\\ \widehat{Ozone} &= e^{-0.262 + 0.003~SolarR - 0.062~Wind + 0.049~Temp} \end{align}\]

\[\begin{align} \dfrac{\text{Ozone at Temp + 1}}{\text{Ozone at Temp}} = \dfrac{e^{-0.262 + 0.003~SolarR - 0.062~Wind + 0.049~(Temp+1)}}{e^{-0.262 + 0.003~SolarR - 0.062~Wind + 0.049~Temp}} = \dfrac{e^{0.049~(Temp+1)}}{e^{0.049~Temp}} = e^{0.049} \end{align}\]

For every one degree increase in temperature, ozone is expected to be higher, on average, by a factor of \(\text{exp}(0.049) = 1.05\). Again, we multiply instead of add.

Shortcuts in R

model_log_ozone %>%
  tidy() %>%
  select(term, estimate) %>%
  mutate(estimate = round(exp(estimate), 3))

## # A tibble: 4 x 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    0.769
## 2 Solar.R        1.00 
## 3 Wind           0.94 
## 4 Temp           1.05

Recap

Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable.
The most common transformation when the response variable is right skewed is the log transform: \(log(y)\), especially useful when the response variable is (extremely) right skewed.
When using a log transformation on the response variable the interpretation of the slope changes: “For each unit increase in x, y is expected to multiply by factor of \(e^{b_1}\).”
Another useful transformation is the square root: \(\sqrt{y}\), especially useful when the response variable is counts.

Transform, or learn more?

Data transformations may also be useful when the relationship is non-linear
However in those cases a polynomial regression may be more appropriate
- This is beyond the scope of this course, but you’re welcomed to try it for your final project, and I’d be happy to provide further guidance

In some cases the value of the response variable might be 0, and

log(0)

## [1] -Inf

The trick is to add a very small number to the value of the response variable for these cases so that the log function can still be applied:

log(0 + 0.00001)

## [1] -11.51293

Linear Regression IV: Transformation

4-01-2021

Song of the Day

Main ideas

Coming up

Packages

Data

Returning to the Air Quality Data

Linear Model with Log Tranformation

Working with logs

Interpreting models with log transformation

Interpreting models with log transformation

Shortcuts in R

Recap

Transform, or learn more?