Song of the Day

Main ideas

Coming up

Packages

We will use the packages tidyverse and broom.

library(tidyverse)
library(broom)

Data

We will use the air quality dataset from Tuesday’s lecture. Recall airquality.csv contains daily air quality measurements in New York from May to September 1973 and collected by the New York State Department of Conservation and the National Weather Service (Chambers, J.M., Cleveland, W.S., Kleiner, B. and Tukey, P.A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth).

air_quality <- read.csv("data/airquality.csv")

\[y = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon\]

\[\hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k\]

Returning to the Air Quality Data

model_ozone <- lm(Ozone ~ Solar.R + Wind + Temp, data = air_quality)
tidy(model_ozone)
## # A tibble: 4 x 5
##   term        estimate std.error statistic       p.value
##   <chr>          <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept) -64.3      23.1        -2.79 0.00623      
## 2 Solar.R       0.0598    0.0232      2.58 0.0112       
## 3 Wind         -3.33      0.654      -5.09 0.00000152   
## 4 Temp          1.65      0.254       6.52 0.00000000242
model_ozone_aug <- augment(model_ozone)

ggplot(model_ozone_aug, 
       mapping = aes(x = .fitted, y = .resid)) + 
  geom_point(alpha = 0.50) + 
  geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) + 
  labs(x = "Predicted", y = "Residuals")

We see that ozone has a right-skewed distribution and a plot of fitted versus residuals shows problems. There is a trend in the residuals and nonconstant variance.

ggplot(model_ozone_aug, mapping = aes(x = Ozone)) +
  geom_histogram()

In these situations a transformation applied to the response variable may be useful. In order to decide which transformation to use, we examine the distribution of the response variable.

The extremely right-skewed distribution suggests a log transformation may be useful. - log = natural log (ln) - default in R is log(x, base = exp(1))

air_quality <- air_quality %>%
  mutate(log_ozone = log(Ozone))

model_log_ozone <- lm(log_ozone ~ Solar.R + Wind + Temp, 
                data = air_quality)
tidy(model_log_ozone)
## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) -0.262    0.554       -0.474 6.37e- 1
## 2 Solar.R      0.00252  0.000557     4.52  1.62e- 5
## 3 Wind        -0.0616   0.0157      -3.92  1.58e- 4
## 4 Temp         0.0492   0.00609      8.08  1.07e-12
model_log_ozone_aug <- augment(model_log_ozone)

ggplot(model_log_ozone_aug, 
       mapping = aes(x = .fitted, y = .resid)) + 
  geom_point(alpha = 0.50) + 
  geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) + 
  labs(x = "Predicted", y = "Residuals")

Linear Model with Log Tranformation

\[ {log(\widehat{Ozone})} = -0.262 + 0.003~SolarR - 0.062~Wind + 0.049~Temp\]

Working with logs

Interpreting models with log transformation

\[\begin{align} 0.049 &= \text{Slope}\\ 0.049 &= \dfrac{log(\text{Ozone at Temp + 1}) - log(\text{Ozone at Temp})}{1}\\ 0.049 &=log\left(\dfrac{\text{Ozone at Temp + 1}}{\text{Ozone at Temp}}\right)\\ e^{0.049} &=e^{log\left(\frac{\text{Ozone at Temp + 1}}{\text{Ozone at Temp}}\right)}\\ 1.05 &\approx \dfrac{\text{Ozone at Temp + 1}}{\text{Ozone at Temp}} \end{align}\]

For every one degree increase in temperature, ozone is expected to be higher, on average, by a factor of \(\text{exp}(0.049) = 1.05\), holding all else constant.

We multiply instead of add.

Interpreting models with log transformation

Alternatively…

\[\begin{align} \widehat{log(Ozone)} &= -0.262 + 0.003~SolarR - 0.062~Wind + 0.049~Temp\\ \widehat{Ozone} &= e^{-0.262 + 0.003~SolarR - 0.062~Wind + 0.049~Temp} \end{align}\]

\[\begin{align} \dfrac{\text{Ozone at Temp + 1}}{\text{Ozone at Temp}} = \dfrac{e^{-0.262 + 0.003~SolarR - 0.062~Wind + 0.049~(Temp+1)}}{e^{-0.262 + 0.003~SolarR - 0.062~Wind + 0.049~Temp}} = \dfrac{e^{0.049~(Temp+1)}}{e^{0.049~Temp}} = e^{0.049} \end{align}\]

For every one degree increase in temperature, ozone is expected to be higher, on average, by a factor of \(\text{exp}(0.049) = 1.05\). Again, we multiply instead of add.

Shortcuts in R

model_log_ozone %>%
  tidy() %>%
  select(term, estimate) %>%
  mutate(estimate = round(exp(estimate), 3))
## # A tibble: 4 x 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    0.769
## 2 Solar.R        1.00 
## 3 Wind           0.94 
## 4 Temp           1.05

Recap

Transform, or learn more?

In some cases the value of the response variable might be 0, and

log(0)
## [1] -Inf

The trick is to add a very small number to the value of the response variable for these cases so that the log function can still be applied:

log(0 + 0.00001)
## [1] -11.51293