Song of the Day

Main ideas

Coming up

Packages

We’ll use the tidyverse, broom, and car packages.

library(tidyverse)
library(broom)
library(car)

The linear model with a single predictor

We’re interested in \(\beta_0\) (population parameter for the intercept) and \(\beta_1\) (population parameter for the slope) in the following model:

\[ y = \beta_0 + \beta_1~x + \epsilon \]

Unfortunately, we can’t get these values, so we use the sample data to estimate them.

\[ \hat{y} = b_0 + b_1~x \]

Least squares regression

The regression line minimizes the sum of squared residuals.

Data

sports_car_prices <- read.csv("data/sportscars.csv")
model_age <- lm(price ~ age, data = sports_car_prices)
tidy(model_age)
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   53246.     3322.     16.0  5.70e-23
## 2 age           -2149.      466.     -4.62 2.22e- 5

But is the age the only variable that predicts price?

The linear model with multiple predictors

\[y = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k +\epsilon\]

\[\hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k\]

Price vs. age and type of car

Does the relationship between price and age depend on type of car?

Modeling with main effects

m_main <- lm(price ~ age + car, data = sports_car_prices)
tidy(m_main)
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   44310.     2768.     16.0  9.34e-23
## 2 age           -2487.      348.     -7.16 1.75e- 9
## 3 carPorsche    21648.     3089.      7.01 3.09e- 9

Linear model:

\[\widehat{Price} = 44310 - 2487~Age + 21648~carPorsche\]

\[\begin{align} \widehat{price} &= 44310 - 2487~age + 21648 \times 0\\ &= 44310 - 2487~age\\ \end{align}\]

\[\begin{align} \widehat{Price} &= 44310 - 2487~Age + 21648 \times 1\\ &= 65958 - 2487~Age\\ \end{align}\]

Interpretation of main effects

Main effects, numerical and categorical predictors

m_main %>%
  tidy() %>%
  select(term, estimate)
## # A tibble: 3 x 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)   44310.
## 2 age           -2487.
## 3 carPorsche    21648.

Question: Are we imposing any structure here? Can you think of a way to improve the model fit?

The model we specified assumes Jaguars and Porsches have the same slope and different intercepts.

What is the most appropriate model for these data?

  1. Same slope and the same intercept for Jaguars and Porsches.
  2. Same slope and different intercepts for Jaguars and Porsches.
  3. Different slopes and different intercepts for Jaguars and Porsches.

Interacting explanatory variables

Modeling with interaction effects

ggplot(
  data = sports_car_prices,
  aes(x = age, y = price, color = car)
) +
  labs(x = "Age (years)", y = "Price (USD)", color = "Car Make") +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

m_int <- lm(price ~ age * car, data = sports_car_prices)
tidy(m_int)
## # A tibble: 4 x 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      56988.     4724.     12.1  3.32e-17
## 2 age              -5040.      861.     -5.85 2.63e- 7
## 3 carPorsche        6387.     5567.      1.15 2.56e- 1
## 4 age:carPorsche    2969.      929.      3.20 2.28e- 3

\[\widehat{Price} = 56988 - 5040~Age + 6387~carPorsche + 2969~Age \times carPorsche\]

Interpretation of interaction effects

\[\widehat{Price} = 56988 - 5040~Age + 6387~carPorsche + 2969~Age \times carPorsche\]

\[\begin{align} \widehat{Price} &= 56988 - 5040~Age + 6387~carPorsche + 2969~Age \times carPorsche \\ &= 56988 - 5040~Age + 6387 \times 0 + 2969~Age \times 0\\ &= 56988 - 5040~Age \end{align}\]

\[\begin{align} \widehat{Price} &= 56988 - 5040~Age + 6387~carPorsche + 2969~Age \times carPorsche \\ &= 56988 - 5040~Age + 6387 \times 1 + 2969~Age \times 1\\ &= 63375 - 2071~Age \end{align}\]

Interpretation of interaction effects

\[\widehat{Price} = 56988 - 5040~Age\]

\[\widehat{Price} = 63375 - 2071~Age\]

Adjusted \(R^2\)

Let’s obtain \(R^2\) for our simple model with just age as an explanatory variable.

glance(model_age) %>%
  pull(r.squared)
## [1] 0.2686457

Roughly 27% of the variability in price of used sports cars can be explained by age.

Now, let’s look at the \(R^2\) for our other two models.

glance(m_main) %>%
  pull(r.squared)
## [1] 0.6071375
glance(m_int) %>%
  pull(r.squared)
## [1] 0.6677881

We can write explained variation using the following ratio of sums of squares:

\[R^2 = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right)\]

where \(SS_{Error}\) is the sum of squared residuals and \(SS_{Total}\) is the total sum of squares.

Adjusted \(R^2\)

\[R^2_{adj} = 1 - \left(\frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - k - 1} \right)\]

Here \(n\) is the number of observations and \(k\) is the number of predictors in the model.

Let’s get the adjusted \(R^2\)

glance(m_main) %>%
  pull(adj.r.squared)
## [1] 0.5933529
glance(m_int) %>%
  pull(adj.r.squared)
## [1] 0.649991

Regression Diagnostics & Conditions

Conditions

Tidy model output

tidy(m_int) %>%
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 3))
## # A tibble: 4 x 2
##   term           estimate
##   <chr>             <dbl>
## 1 (Intercept)      56988.
## 2 age              -5040.
## 3 carPorsche        6387.
## 4 age:carPorsche    2969.

augment data with model results

m_int_aug <- augment(m_int)
glimpse(m_int_aug)
## Rows: 60
## Columns: 9
## $ price      <int> 69400, 56900, 49900, 47400, 42900, 36900, 83000, 72900, 699…
## $ age        <int> 3, 3, 2, 4, 4, 6, 0, 0, 2, 0, 2, 2, 4, 3, 10, 11, 4, 4, 10,…
## $ car        <chr> "Porsche", "Porsche", "Porsche", "Porsche", "Porsche", "Por…
## $ .fitted    <dbl> 57162.92, 57162.92, 59233.63, 55092.22, 55092.22, 50950.81,…
## $ .resid     <dbl> 12237.0778, -262.9222, -9333.6270, -7692.2173, -12192.2173,…
## $ .std.resid <dbl> 1.13837111, -0.02445870, -0.87165630, -0.71356517, -1.13100…
## $ .hat       <dbl> 0.04358564, 0.04358564, 0.05099453, 0.03817915, 0.03817915,…
## $ .sigma     <dbl> 10962.24, 11091.27, 11015.83, 11040.79, 10963.92, 10922.64,…
## $ .cooksd    <dbl> 1.476403e-02, 6.815599e-06, 1.020670e-02, 5.052884e-03, 1.2…

We will use the fitted values and residuals to check the conditions by constructing diagnostic plots.

Linearity

ggplot(
  data = sports_car_prices,
  aes(
    x = age,
    y = price
  )
) +
  geom_point()

Independence

Residuals in order of collection

ggplot(
  data = m_int_aug,
  aes(
    x = 1:nrow(sports_car_prices),
    y = .resid
  )
) +
  geom_point() +
  labs(x = "Index", y = "Residual")

Equal Variance, Linearity

Residuals vs fitted plot

ggplot(m_int_aug, mapping = aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) +
  labs(x = "Predicted Price", y = "Residuals")

Normality

Histogram of residuals

ggplot(m_int_aug, mapping = aes(x = .resid)) +
  geom_histogram(bins = 15) +
  labs(x = "Residuals")

Normal Q-Q Plot

ggplot(m_int_aug, mapping = aes(sample = .resid)) +
  stat_qq() + 
  stat_qq_line()

Multicollinearity

You don’t want the predictors to be too correlated with each other in a multiple regression model. When they are correlated with each other, you have mutlicollinearity.

One way to diagnose multicollinearity is with variance inflation factors. There’s no specific cutoff, but a VIF of 5 if sometimes used as a cutoff.

Let’s see if we have multicollinearity in our first model.

vif(m_main)
##     age     car 
## 1.01964 1.01964

Now, let’s check if for the interactive model.

vif(m_int)
##       age       car   age:car 
##  7.268869  3.847798 11.244300

Practice

  1. Run and interpret a multiple regression with both age and mileage as predictors. Are both of these statistically significant predictors of the price of a car?
model_age_mileage <- lm(price ~ age + mileage, data = sports_car_prices)
tidy(model_age_mileage)
## # A tibble: 3 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 62950.     3176.       19.8   3.03e-27
## 2 age           516.      601.        0.859 3.94e- 1
## 3 mileage        -0.695     0.122    -5.68  4.75e- 7
  1. Find and interpret the adjusted \(R^2\) for this model.
glance(model_age_mileage) %>%
  pull(adj.r.squared)
## [1] 0.5167014
  1. Examine the extent to which there is multicollinearity in this model.
vif(model_age_mileage)
##      age  mileage 
## 2.562603 2.562603

Now, turn to the dataset in nycairquality.csv. This file contains daily air quality measurements in New York from May to September 1973 and collected by the New York State Department of Conservation and the National Weather Service (Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth).

airquality <- read.csv("data/airquality.csv")
  1. Run and interpret a model with ozone in parts per billion as the response variable and solar radiation, wind, and temperature as the explanatory variables.
model_ozone <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
tidy(model_ozone)
## # A tibble: 4 x 5
##   term        estimate std.error statistic       p.value
##   <chr>          <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept) -64.3      23.1        -2.79 0.00623      
## 2 Solar.R       0.0598    0.0232      2.58 0.0112       
## 3 Wind         -3.33      0.654      -5.09 0.00000152   
## 4 Temp          1.65      0.254       6.52 0.00000000242
  1. Use augment to obtain the value of residuals and assess independence with a plot.
model_ozone_aug <- augment(model_ozone)

ggplot(
  data = model_ozone_aug,
  aes(x = 1:nrow(airquality),
      y = .resid)
) + 
  geom_point() + 
  labs(x = "Index", y = "Residuals")

  1. Next, make a plot comparing predicted values to residuals to assess the equal variance and linearity conditions.
ggplot(model_ozone_aug,
       mapping = aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.50) +
  geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2)

  1. Finally, make a histogram and a q-q plot to assess the normality condition.
ggplot(model_ozone_aug, mapping = aes(x = .resid)) +
  geom_histogram(bins = 15) +
  labs(x = "Residuals")

ggplot(model_ozone_aug, mapping = aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line()