Main ideas
- Learn how to build and interpret multiple regression models
- Learn how to assess the conditions for linear regression
We’ll use the tidyverse
, broom
, and car
packages.
library(tidyverse)
library(broom)
library(car)
Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis. Also called the dependent variable.
Explanatory variable: Other variables that you want to use to explain the variation in the response, on the x-axis. Also called independent variables, predictors, or features.
Predicted value: Output of the model function.
Residuals: Shows how far each case is from its predicted value.
We’re interested in \(\beta_0\) (population parameter for the intercept) and \(\beta_1\) (population parameter for the slope) in the following model:
\[ y = \beta_0 + \beta_1~x + \epsilon \]
Unfortunately, we can’t get these values, so we use the sample data to estimate them.
\[ \hat{y} = b_0 + b_1~x \]
The regression line minimizes the sum of squared residuals.
sports_car_prices <- read.csv("data/sportscars.csv")
model_age <- lm(price ~ age, data = sports_car_prices)
tidy(model_age)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 53246. 3322. 16.0 5.70e-23
## 2 age -2149. 466. -4.62 2.22e- 5
But is the age the only variable that predicts price?
\[y = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon\]
\[\hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k\]
Does the relationship between price and age depend on type of car?
m_main <- lm(price ~ age + car, data = sports_car_prices)
tidy(m_main)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 44310. 2768. 16.0 9.34e-23
## 2 age -2487. 348. -7.16 1.75e- 9
## 3 carPorsche 21648. 3089. 7.01 3.09e- 9
Linear model:
\[\widehat{Price} = 44310 - 2487~Age + 21648~carPorsche\]
Plug in 0 for carPorsche
to get the linear model for Jaguars.
Plug in 1 for carPorsche
to get the linear model for Porsches.
Jaguar:
\[\begin{align} \widehat{price} &= 44310 - 2487~age + 21648 \times 0\\ &= 44310 - 2487~age\\ \end{align}\]
\[\begin{align} \widehat{Price} &= 44310 - 2487~Age + 21648 \times 1\\ &= 65958 - 2487~Age\\ \end{align}\]
m_main %>%
tidy() %>%
select(term, estimate)
## # A tibble: 3 x 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 44310.
## 2 age -2487.
## 3 carPorsche 21648.
All else held constant, for each additional year of a car’s age, the price of the car is predicted to decrease, on average, by $2,487.
All else held constant, Porsches are predicted, on average, to have a price that is $21,648 greater than Jaguars.
Jaguars that have an age of 0 are predicted, on average, to have a price of $44,310.
Question: Are we imposing any structure here? Can you think of a way to improve the model fit?
The model we specified assumes Jaguars and Porsches have the same slope and different intercepts.
What is the most appropriate model for these data?
ggplot(
data = sports_car_prices,
aes(x = age, y = price, color = car)
) +
labs(x = "Age (years)", y = "Price (USD)", color = "Car Make") +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
m_int <- lm(price ~ age * car, data = sports_car_prices)
tidy(m_int)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 56988. 4724. 12.1 3.32e-17
## 2 age -5040. 861. -5.85 2.63e- 7
## 3 carPorsche 6387. 5567. 1.15 2.56e- 1
## 4 age:carPorsche 2969. 929. 3.20 2.28e- 3
\[\widehat{Price} = 56988 - 5040~Age + 6387~carPorsche + 2969~Age \times carPorsche\]
\[\widehat{Price} = 56988 - 5040~Age + 6387~carPorsche + 2969~Age \times carPorsche\]
Plug in 0 for carPorsche
to get the linear model for Jaguars.
Plug in 1 for carPorsche
to get the linear model for Porsches.
Jaguar:
\[\begin{align} \widehat{Price} &= 56988 - 5040~Age + 6387~carPorsche + 2969~Age \times carPorsche \\ &= 56988 - 5040~Age + 6387 \times 0 + 2969~Age \times 0\\ &= 56988 - 5040~Age \end{align}\]
\[\begin{align} \widehat{Price} &= 56988 - 5040~Age + 6387~carPorsche + 2969~Age \times carPorsche \\ &= 56988 - 5040~Age + 6387 \times 1 + 2969~Age \times 1\\ &= 63375 - 2071~Age \end{align}\]
\[\widehat{Price} = 56988 - 5040~Age\]
\[\widehat{Price} = 63375 - 2071~Age\]
Let’s obtain \(R^2\) for our simple model with just age as an explanatory variable.
glance(model_age) %>%
pull(r.squared)
## [1] 0.2686457
Roughly 27% of the variability in price of used sports cars can be explained by age.
Now, let’s look at the \(R^2\) for our other two models.
glance(m_main) %>%
pull(r.squared)
## [1] 0.6071375
glance(m_int) %>%
pull(r.squared)
## [1] 0.6677881
The model with both age and make has an \(R^2\) of about 61% and the model with the interaction term has an even higher \(R^2\) of 66%.
Using \(R^2\) for model selection in models with multiple explanatory variables is not a good idea as \(R^2\) increases when any variable is added to the model.
We can write explained variation using the following ratio of sums of squares:
\[R^2 = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right)\]
where \(SS_{Error}\) is the sum of squared residuals and \(SS_{Total}\) is the total sum of squares.
\[R^2_{adj} = 1 - \left(\frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - k - 1} \right)\]
Here \(n\) is the number of observations and \(k\) is the number of predictors in the model.
Let’s get the adjusted \(R^2\)
glance(m_main) %>%
pull(adj.r.squared)
## [1] 0.5933529
glance(m_int) %>%
pull(adj.r.squared)
## [1] 0.649991
Linearity: The relationship between response and predictor(s) is linear
Independence: The residuals are independent
Normality: The residuals are nearly normally distributed
Equal Variance: The residuals have constant variance
Additionally, for multiple regression, the predictors should not be too correlated with each other.
tidy(m_int) %>%
select(term, estimate) %>%
mutate(estimate = round(estimate, 3))
## # A tibble: 4 x 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 56988.
## 2 age -5040.
## 3 carPorsche 6387.
## 4 age:carPorsche 2969.
augment
data with model results.fitted
: Predicted value of the response variable.resid
: Residualsm_int_aug <- augment(m_int)
glimpse(m_int_aug)
## Rows: 60
## Columns: 9
## $ price <int> 69400, 56900, 49900, 47400, 42900, 36900, 83000, 72900, 699…
## $ age <int> 3, 3, 2, 4, 4, 6, 0, 0, 2, 0, 2, 2, 4, 3, 10, 11, 4, 4, 10,…
## $ car <chr> "Porsche", "Porsche", "Porsche", "Porsche", "Porsche", "Por…
## $ .fitted <dbl> 57162.92, 57162.92, 59233.63, 55092.22, 55092.22, 50950.81,…
## $ .resid <dbl> 12237.0778, -262.9222, -9333.6270, -7692.2173, -12192.2173,…
## $ .std.resid <dbl> 1.13837111, -0.02445870, -0.87165630, -0.71356517, -1.13100…
## $ .hat <dbl> 0.04358564, 0.04358564, 0.05099453, 0.03817915, 0.03817915,…
## $ .sigma <dbl> 10962.24, 11091.27, 11015.83, 11040.79, 10963.92, 10922.64,…
## $ .cooksd <dbl> 1.476403e-02, 6.815599e-06, 1.020670e-02, 5.052884e-03, 1.2…
We will use the fitted values and residuals to check the conditions by constructing diagnostic plots.
ggplot(
data = sports_car_prices,
aes(
x = age,
y = price
)
) +
geom_point()
ggplot(
data = m_int_aug,
aes(
x = 1:nrow(sports_car_prices),
y = .resid
)
) +
geom_point() +
labs(x = "Index", y = "Residual")
ggplot(m_int_aug, mapping = aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) +
labs(x = "Predicted Price", y = "Residuals")
ggplot(m_int_aug, mapping = aes(x = .resid)) +
geom_histogram(bins = 15) +
labs(x = "Residuals")
ggplot(m_int_aug, mapping = aes(sample = .resid)) +
stat_qq() +
stat_qq_line()
You don’t want the predictors to be too correlated with each other in a multiple regression model. When they are correlated with each other, you have mutlicollinearity.
One way to diagnose multicollinearity is with variance inflation factors. There’s no specific cutoff, but a VIF of 5 if sometimes used as a cutoff.
Let’s see if we have multicollinearity in our first model.
vif(m_main)
## age car
## 1.01964 1.01964
Now, let’s check if for the interactive model.
vif(m_int)
## age car age:car
## 7.268869 3.847798 11.244300
model_age_mileage <- lm(price ~ age + mileage, data = sports_car_prices)
tidy(model_age_mileage)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 62950. 3176. 19.8 3.03e-27
## 2 age 516. 601. 0.859 3.94e- 1
## 3 mileage -0.695 0.122 -5.68 4.75e- 7
glance(model_age_mileage) %>%
pull(adj.r.squared)
## [1] 0.5167014
vif(model_age_mileage)
## age mileage
## 2.562603 2.562603
Now, turn to the dataset in nycairquality.csv
. This file contains daily air quality measurements in New York from May to September 1973 and collected by the New York State Department of Conservation and the National Weather Service (Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth).
Ozone
: ozone (ppb)Solar.R
: solar radiation (langleys)Wind
: wind (mpg)Temp
: temperature (degrees F)airquality <- read.csv("data/airquality.csv")
model_ozone <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
tidy(model_ozone)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -64.3 23.1 -2.79 0.00623
## 2 Solar.R 0.0598 0.0232 2.58 0.0112
## 3 Wind -3.33 0.654 -5.09 0.00000152
## 4 Temp 1.65 0.254 6.52 0.00000000242
model_ozone_aug <- augment(model_ozone)
ggplot(
data = model_ozone_aug,
aes(x = 1:nrow(airquality),
y = .resid)
) +
geom_point() +
labs(x = "Index", y = "Residuals")
ggplot(model_ozone_aug,
mapping = aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.50) +
geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2)
ggplot(model_ozone_aug, mapping = aes(x = .resid)) +
geom_histogram(bins = 15) +
labs(x = "Residuals")
ggplot(model_ozone_aug, mapping = aes(sample = .resid)) +
stat_qq() +
stat_qq_line()