Linear Regression III: Multiple Regression

Song of the Day

Different Drum - The Stone Poneys

Main ideas

Learn how to build and interpret multiple regression models
Learn how to assess the conditions for linear regression

Coming up

Exam #02 opens on Thursday
Solutions uploaded
RStudio help
Grad School Panel

Packages

We’ll use the tidyverse, broom, and car packages.

library(tidyverse)
library(broom)
library(car)

Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis. Also called the dependent variable.
Explanatory variable: Other variables that you want to use to explain the variation in the response, on the x-axis. Also called independent variables, predictors, or features.
Predicted value: Output of the model function.
- the model function gives the typical value of the response variable conditioning on the explanatory variables.
Residuals: Shows how far each case is from its predicted value.
- residual = observed value - predicted value
- residual = data - fit
- how far above or below the model function each case is

The linear model with a single predictor

We’re interested in $\beta_0$ (population parameter for the intercept) and $\beta_1$ (population parameter for the slope) in the following model:

\[ y = \beta_0 + \beta_1~x + \epsilon \]

Unfortunately, we can’t get these values, so we use the sample data to estimate them.

\[ \hat{y} = b_0 + b_1~x \]

Least squares regression

The regression line minimizes the sum of squared residuals.

Residuals: Residuals are data - fits, or $e_i = y_i - \hat{y}_i$
The regression line minimizes $\sum_{i = 1}^n e_i^2$.
Equivalently, minimizing $\sum_{i = 1}^n [y_i - (b_0 + b_1~x_i)]^2$.

Data

sports_car_prices <- read.csv("data/sportscars.csv")

model_age <- lm(price ~ age, data = sports_car_prices)
tidy(model_age)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   53246.     3322.     16.0  5.70e-23
## 2 age           -2149.      466.     -4.62 2.22e- 5

But is the age the only variable that predicts price?

The linear model with multiple predictors

\[y = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon\]

\[\hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k\]

Price vs. age and type of car

Does the relationship between price and age depend on type of car?

Modeling with main effects

m_main <- lm(price ~ age + car, data = sports_car_prices)
tidy(m_main)

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   44310.     2768.     16.0  9.34e-23
## 2 age           -2487.      348.     -7.16 1.75e- 9
## 3 carPorsche    21648.     3089.      7.01 3.09e- 9

Linear model:

\[\widehat{Price} = 44310 - 2487~Age + 21648~carPorsche\]

Plug in 0 for carPorsche to get the linear model for Jaguars.
Plug in 1 for carPorsche to get the linear model for Porsches.
Jaguar:

\[\begin{align} \widehat{price} &= 44310 - 2487~age + 21648 \times 0\\ &= 44310 - 2487~age\\ \end{align}\]

Porsche:

\[\begin{align} \widehat{Price} &= 44310 - 2487~Age + 21648 \times 1\\ &= 65958 - 2487~Age\\ \end{align}\]

The change in price for an increase in age does not depend on the car make (same slopes).
Porsches are consistently more expensive than Jaguars (different intercepts).

Interpretation of main effects

Main effects, numerical and categorical predictors

m_main %>%
  tidy() %>%
  select(term, estimate)

## # A tibble: 3 x 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)   44310.
## 2 age           -2487.
## 3 carPorsche    21648.

All else held constant, for each additional year of a car’s age, the price of the car is predicted to decrease, on average, by $2,487.
All else held constant, Porsches are predicted, on average, to have a price that is $21,648 greater than Jaguars.
Jaguars that have an age of 0 are predicted, on average, to have a price of $44,310.

Question: Are we imposing any structure here? Can you think of a way to improve the model fit?

The model we specified assumes Jaguars and Porsches have the same slope and different intercepts.

What is the most appropriate model for these data?

Same slope and the same intercept for Jaguars and Porsches.
Same slope and different intercepts for Jaguars and Porsches.
Different slopes and different intercepts for Jaguars and Porsches.

Interacting explanatory variables

The relationship between an explanatory variable and the response can depend on another explanatory variable.
We can accomplish this by adding an interaction variable. This is the product of two explanatory variables and allows for different slopes.

Modeling with interaction effects

ggplot(
  data = sports_car_prices,
  aes(x = age, y = price, color = car)
) +
  labs(x = "Age (years)", y = "Price (USD)", color = "Car Make") +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

m_int <- lm(price ~ age * car, data = sports_car_prices)
tidy(m_int)

## # A tibble: 4 x 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      56988.     4724.     12.1  3.32e-17
## 2 age              -5040.      861.     -5.85 2.63e- 7
## 3 carPorsche        6387.     5567.      1.15 2.56e- 1
## 4 age:carPorsche    2969.      929.      3.20 2.28e- 3

\[\widehat{Price} = 56988 - 5040~Age + 6387~carPorsche + 2969~Age \times carPorsche\]

Interpretation of interaction effects

\[\widehat{Price} = 56988 - 5040~Age + 6387~carPorsche + 2969~Age \times carPorsche\]

Plug in 0 for carPorsche to get the linear model for Jaguars.
Plug in 1 for carPorsche to get the linear model for Porsches.
Jaguar:

\[\begin{align} \widehat{Price} &= 56988 - 5040~Age + 6387~carPorsche + 2969~Age \times carPorsche \\ &= 56988 - 5040~Age + 6387 \times 0 + 2969~Age \times 0\\ &= 56988 - 5040~Age \end{align}\]

Porsche:

\[\begin{align} \widehat{Price} &= 56988 - 5040~Age + 6387~carPorsche + 2969~Age \times carPorsche \\ &= 56988 - 5040~Age + 6387 \times 1 + 2969~Age \times 1\\ &= 63375 - 2071~Age \end{align}\]

Interpretation of interaction effects

Jaguar:

\[\widehat{Price} = 56988 - 5040~Age\]

Porsche:

\[\widehat{Price} = 63375 - 2071~Age\]

Change in price as the age increases depends on the make (different slopes).
Porsches are consistently more expensive than Jaguars (different intercepts).
Continuous by continuous interactions / third order interactions are possible but tricky.

Adjusted $R^2$

The strength of the fit of a linear model is commonly evaluated using $R^2$.
It tells us what percentage of the variability in the response variable is explained by the model. The remainder of the variability is unexplained.
$R^2$ is sometimes called the coefficient of determination.

Let’s obtain $R^2$ for our simple model with just age as an explanatory variable.

glance(model_age) %>%
  pull(r.squared)

## [1] 0.2686457

Roughly 27% of the variability in price of used sports cars can be explained by age.

Now, let’s look at the $R^2$ for our other two models.

glance(m_main) %>%
  pull(r.squared)

## [1] 0.6071375

glance(m_int) %>%
  pull(r.squared)

## [1] 0.6677881

The model with both age and make has an $R^2$ of about 61% and the model with the interaction term has an even higher $R^2$ of 66%.
Using $R^2$ for model selection in models with multiple explanatory variables is not a good idea as $R^2$ increases when any variable is added to the model.

We can write explained variation using the following ratio of sums of squares:

\[R^2 = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right)\]

where $SS_{Error}$ is the sum of squared residuals and $SS_{Total}$ is the total sum of squares.

Adjusted $R^2$

\[R^2_{adj} = 1 - \left(\frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - k - 1} \right)\]

Here $n$ is the number of observations and $k$ is the number of predictors in the model.

Adjusted $R^2$ doesn’t increase if the new variable does not provide any new information or is completely unrelated.
This makes adjusted $R^2$ a preferable metric for multiple regression models.

Let’s get the adjusted $R^2$

glance(m_main) %>%
  pull(adj.r.squared)

## [1] 0.5933529

glance(m_int) %>%
  pull(adj.r.squared)

## [1] 0.649991

Regression Diagnostics & Conditions

Conditions

Linearity: The relationship between response and predictor(s) is linear
Independence: The residuals are independent
Normality: The residuals are nearly normally distributed
Equal Variance: The residuals have constant variance
Additionally, for multiple regression, the predictors should not be too correlated with each other.

Tidy model output

tidy(m_int) %>%
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 3))

## # A tibble: 4 x 2
##   term           estimate
##   <chr>             <dbl>
## 1 (Intercept)      56988.
## 2 age              -5040.
## 3 carPorsche        6387.
## 4 age:carPorsche    2969.

`augment` data with model results

.fitted: Predicted value of the response variable
.resid: Residuals

m_int_aug <- augment(m_int)
glimpse(m_int_aug)

## Rows: 60
## Columns: 9
## $ price      <int> 69400, 56900, 49900, 47400, 42900, 36900, 83000, 72900, 699…
## $ age        <int> 3, 3, 2, 4, 4, 6, 0, 0, 2, 0, 2, 2, 4, 3, 10, 11, 4, 4, 10,…
## $ car        <chr> "Porsche", "Porsche", "Porsche", "Porsche", "Porsche", "Por…
## $ .fitted    <dbl> 57162.92, 57162.92, 59233.63, 55092.22, 55092.22, 50950.81,…
## $ .resid     <dbl> 12237.0778, -262.9222, -9333.6270, -7692.2173, -12192.2173,…
## $ .std.resid <dbl> 1.13837111, -0.02445870, -0.87165630, -0.71356517, -1.13100…
## $ .hat       <dbl> 0.04358564, 0.04358564, 0.05099453, 0.03817915, 0.03817915,…
## $ .sigma     <dbl> 10962.24, 11091.27, 11015.83, 11040.79, 10963.92, 10922.64,…
## $ .cooksd    <dbl> 1.476403e-02, 6.815599e-06, 1.020670e-02, 5.052884e-03, 1.2…

We will use the fitted values and residuals to check the conditions by constructing diagnostic plots.

Linearity

ggplot(
  data = sports_car_prices,
  aes(
    x = age,
    y = price
  )
) +
  geom_point()

Independence

Residuals in order of collection

ggplot(
  data = m_int_aug,
  aes(
    x = 1:nrow(sports_car_prices),
    y = .resid
  )
) +
  geom_point() +
  labs(x = "Index", y = "Residual")

Equal Variance, Linearity

Residuals vs fitted plot

ggplot(m_int_aug, mapping = aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) +
  labs(x = "Predicted Price", y = "Residuals")

Normality

Histogram of residuals

ggplot(m_int_aug, mapping = aes(x = .resid)) +
  geom_histogram(bins = 15) +
  labs(x = "Residuals")

Normal Q-Q Plot

ggplot(m_int_aug, mapping = aes(sample = .resid)) +
  stat_qq() + 
  stat_qq_line()

Multicollinearity

You don’t want the predictors to be too correlated with each other in a multiple regression model. When they are correlated with each other, you have mutlicollinearity.

One way to diagnose multicollinearity is with variance inflation factors. There’s no specific cutoff, but a VIF of 5 if sometimes used as a cutoff.

Let’s see if we have multicollinearity in our first model.

vif(m_main)

##     age     car 
## 1.01964 1.01964

Now, let’s check if for the interactive model.

vif(m_int)

##       age       car   age:car 
##  7.268869  3.847798 11.244300

Practice

Run and interpret a multiple regression with both age and mileage as predictors. Are both of these statistically significant predictors of the price of a car?

model_age_mileage <- lm(price ~ age + mileage, data = sports_car_prices)
tidy(model_age_mileage)

## # A tibble: 3 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 62950.     3176.       19.8   3.03e-27
## 2 age           516.      601.        0.859 3.94e- 1
## 3 mileage        -0.695     0.122    -5.68  4.75e- 7

Find and interpret the adjusted $R^2$ for this model.

glance(model_age_mileage) %>%
  pull(adj.r.squared)

## [1] 0.5167014

Examine the extent to which there is multicollinearity in this model.

vif(model_age_mileage)

##      age  mileage 
## 2.562603 2.562603

Now, turn to the dataset in nycairquality.csv. This file contains daily air quality measurements in New York from May to September 1973 and collected by the New York State Department of Conservation and the National Weather Service (Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth).

Ozone: ozone (ppb)
Solar.R: solar radiation (langleys)
Wind: wind (mpg)
Temp: temperature (degrees F)

airquality <- read.csv("data/airquality.csv")

Run and interpret a model with ozone in parts per billion as the response variable and solar radiation, wind, and temperature as the explanatory variables.

model_ozone <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
tidy(model_ozone)

## # A tibble: 4 x 5
##   term        estimate std.error statistic       p.value
##   <chr>          <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept) -64.3      23.1        -2.79 0.00623      
## 2 Solar.R       0.0598    0.0232      2.58 0.0112       
## 3 Wind         -3.33      0.654      -5.09 0.00000152   
## 4 Temp          1.65      0.254       6.52 0.00000000242

Use augment to obtain the value of residuals and assess independence with a plot.

model_ozone_aug <- augment(model_ozone)

ggplot(
  data = model_ozone_aug,
  aes(x = 1:nrow(airquality),
      y = .resid)
) + 
  geom_point() + 
  labs(x = "Index", y = "Residuals")

Next, make a plot comparing predicted values to residuals to assess the equal variance and linearity conditions.

ggplot(model_ozone_aug,
       mapping = aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.50) +
  geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2)

Finally, make a histogram and a q-q plot to assess the normality condition.

ggplot(model_ozone_aug, mapping = aes(x = .resid)) +
  geom_histogram(bins = 15) +
  labs(x = "Residuals")

ggplot(model_ozone_aug, mapping = aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line()

Linear Regression III: Multiple Regression

3-30-2021

Song of the Day

Main ideas

Coming up

Packages

The linear model with a single predictor

Least squares regression

Data

The linear model with multiple predictors

Price vs. age and type of car

Modeling with main effects

Interpretation of main effects

Main effects, numerical and categorical predictors

Interacting explanatory variables

Modeling with interaction effects

Interpretation of interaction effects

Interpretation of interaction effects

Adjusted \(R^2\)

Adjusted \(R^2\)

Regression Diagnostics & Conditions

Conditions

Tidy model output

`augment` data with model results

Linearity

Independence

Residuals in order of collection

Equal Variance, Linearity

Residuals vs fitted plot

Normality

Histogram of residuals

Normal Q-Q Plot

Multicollinearity

Practice

Linear Regression III: Multiple Regression

3-30-2021

Song of the Day

Main ideas

Coming up

Packages

The linear model with a single predictor

Least squares regression

Data

The linear model with multiple predictors

Price vs. age and type of car

Modeling with main effects

Interpretation of main effects

Main effects, numerical and categorical predictors

Interacting explanatory variables

Modeling with interaction effects

Interpretation of interaction effects

Interpretation of interaction effects

Adjusted \(R^2\)

Adjusted \(R^2\)

Regression Diagnostics & Conditions

Conditions

Tidy model output

augment data with model results

Linearity

Independence

Residuals in order of collection

Equal Variance, Linearity

Residuals vs fitted plot

Normality

Histogram of residuals

Normal Q-Q Plot

Multicollinearity

Practice

`augment` data with model results