Main ideas
- Review multiple linear regression.
- Discuss transformations.
- Review for exam.
We will use the packages tidyverse
and broom
.
library(tidyverse)
library(broom)
We will use the air quality dataset from Tuesday’s lecture. Recall airquality.csv
contains daily air quality measurements in New York from May to September 1973 and collected by the New York State Department of Conservation and the National Weather Service (Chambers, J.M., Cleveland, W.S., Kleiner, B. and Tukey, P.A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth).
air_quality <- read.csv("data/airquality.csv")
\[y = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon\]
\[\hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k\]
Linearity: The relationship between response and predictor(s) is linear
Independence: The residuals are independent
Normality: The residuals are nearly normally distributed
Equal Variance: The residuals have constant variance
and not too much multicollinearity.
model_ozone <- lm(Ozone ~ Solar.R + Wind + Temp, data = air_quality)
tidy(model_ozone)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -64.3 23.1 -2.79 0.00623
## 2 Solar.R 0.0598 0.0232 2.58 0.0112
## 3 Wind -3.33 0.654 -5.09 0.00000152
## 4 Temp 1.65 0.254 6.52 0.00000000242
model_ozone_aug <- augment(model_ozone)
ggplot(model_ozone_aug,
mapping = aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.50) +
geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) +
labs(x = "Predicted", y = "Residuals")
We see that ozone has a right-skewed distribution and a plot of fitted versus residuals shows problems. There is a trend in the residuals and nonconstant variance.
ggplot(model_ozone_aug, mapping = aes(x = Ozone)) +
geom_histogram()
In these situations a transformation applied to the response variable may be useful. In order to decide which transformation to use, we examine the distribution of the response variable.
The extremely right-skewed distribution suggests a log transformation may be useful. - log = natural log (ln) - default in R
is log(x, base = exp(1))
air_quality <- air_quality %>%
mutate(log_ozone = log(Ozone))
model_log_ozone <- lm(log_ozone ~ Solar.R + Wind + Temp,
data = air_quality)
tidy(model_log_ozone)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.262 0.554 -0.474 6.37e- 1
## 2 Solar.R 0.00252 0.000557 4.52 1.62e- 5
## 3 Wind -0.0616 0.0157 -3.92 1.58e- 4
## 4 Temp 0.0492 0.00609 8.08 1.07e-12
model_log_ozone_aug <- augment(model_log_ozone)
ggplot(model_log_ozone_aug,
mapping = aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.50) +
geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) +
labs(x = "Predicted", y = "Residuals")
\[ {log(\widehat{Ozone})} = -0.262 + 0.003~SolarR - 0.062~Wind + 0.049~Temp\]
\[\begin{align} 0.049 &= \text{Slope}\\ 0.049 &= \dfrac{log(\text{Ozone at Temp + 1}) - log(\text{Ozone at Temp})}{1}\\ 0.049 &=log\left(\dfrac{\text{Ozone at Temp + 1}}{\text{Ozone at Temp}}\right)\\ e^{0.049} &=e^{log\left(\frac{\text{Ozone at Temp + 1}}{\text{Ozone at Temp}}\right)}\\ 1.05 &\approx \dfrac{\text{Ozone at Temp + 1}}{\text{Ozone at Temp}} \end{align}\]
For every one degree increase in temperature, ozone is expected to be higher, on average, by a factor of \(\text{exp}(0.049) = 1.05\), holding all else constant.
We multiply instead of add.
Alternatively…
\[\begin{align} \widehat{log(Ozone)} &= -0.262 + 0.003~SolarR - 0.062~Wind + 0.049~Temp\\ \widehat{Ozone} &= e^{-0.262 + 0.003~SolarR - 0.062~Wind + 0.049~Temp} \end{align}\]
\[\begin{align} \dfrac{\text{Ozone at Temp + 1}}{\text{Ozone at Temp}} = \dfrac{e^{-0.262 + 0.003~SolarR - 0.062~Wind + 0.049~(Temp+1)}}{e^{-0.262 + 0.003~SolarR - 0.062~Wind + 0.049~Temp}} = \dfrac{e^{0.049~(Temp+1)}}{e^{0.049~Temp}} = e^{0.049} \end{align}\]
For every one degree increase in temperature, ozone is expected to be higher, on average, by a factor of \(\text{exp}(0.049) = 1.05\). Again, we multiply instead of add.
model_log_ozone %>%
tidy() %>%
select(term, estimate) %>%
mutate(estimate = round(exp(estimate), 3))
## # A tibble: 4 x 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 0.769
## 2 Solar.R 1.00
## 3 Wind 0.94
## 4 Temp 1.05
In some cases the value of the response variable might be 0, and
log(0)
## [1] -Inf
The trick is to add a very small number to the value of the response variable for these cases so that the log
function can still be applied:
log(0 + 0.00001)
## [1] -11.51293