Linear Regression II

Song of the Day

Gloria - Laura Branigan

Main ideas

Review and expand on regression ideas.
Introduce inference for linear regression.

Coming up

Homework #04 due Thursday 3-25 11:59 PM
Lab #08 on Friday 3-26
Exam #02 next week
Work on final projects

Packages and Data

library(tidyverse)
library(broom)

We’ll continue our work with the sports_cars data.

sports_car_prices <- read_csv("data/sportscars.csv")

Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis. Also called the dependent variable.
Explanatory variable: Other variables that you want to use to explain the variation in the response, on the x-axis. Also called independent variables, predictors, or features.
Predicted value: Output of the model function.
- the model function gives the typical value of the response variable conditioning on the explanatory variables.
Residuals: Shows how far each case is from its predicted value.
- residual = observed value - predicted value
- residual = data - fit
- how far above or below the model function each case is

The linear model with a single predictor

We’re interested in $\beta_0$ (population parameter for the intercept) and $\beta_1$ (population parameter for the slope) in the following model:

\[ {y} = \beta_0 + \beta_1~x + \epsilon \]

Unfortunately, we can’t get these values, so we use the sample data to estimate them.

\[ \hat{y} = b_0 + b_1~x \]

Least squares regression

The regression line minimizes the sum of squared residuals.

Residuals: Residuals are data - fits, or $e_i = y_i - \hat{y}_i$
The regression line minimizes $\sum_{i = 1}^n e_i^2$.
Equivalently, minimizing $\sum_{i = 1}^n [y_i - (b_0 + b_1~x_i)]^2$.

Returning to our mileage model

Let’s return to our mileage model from Tuesday.

model_mileage <- lm(price ~ mileage, data = sports_car_prices)
tidy(model_mileage)

## # A tibble: 2 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 62928.    3169.         19.9  1.50e-27
## 2 mileage        -0.613    0.0762     -8.04 5.26e-11

\[\widehat{Price} = 62928 - 0.613~Mileage\]

Assessing the quality of the fit

The strength of the fit of a linear model is commonly evaluated using $R^2$.
It tells us what percentage of the variability in the response variable is explained by the model. The remainder of the variability is unexplained.
$R^2$ is sometimes called the coefficient of determination.
$R^2$ can only ever increase when we add variables to the model.

Obtaining $R^2$ in R

glance(model_mileage)

## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.527         0.519 12887.      64.6 5.26e-11     1  -652. 1310. 1316.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

glance(model_mileage) %>%
  select(r.squared)

## # A tibble: 1 x 1
##   r.squared
##       <dbl>
## 1     0.527

Approximately 52.7% of the variability in the price of used sports can be explained by mileage.

ggplot(data = sports_car_prices, 
       aes(x = mileage, y = price)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

sports_car_prices %>%
  summarize(cor = cor(price, mileage)) %>%
  mutate(r2 = cor^2)

## # A tibble: 1 x 2
##      cor    r2
##    <dbl> <dbl>
## 1 -0.726 0.527

$R^2$: first principles

We can write explained variation using the following ratio of sums of squares:

\[ R^2 = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right) \]

where $SS_{Error}$ is the sum of squared residuals and $SS_{Total}$ is the total variance in the response variable. But remember, adding any explanatory variables will increase $R^2$.

Next class, when we talk about multiple regression, we will discuss another measure of model fit called adjusted $R^2$ that is preferable for models with many explanatory variables.

CLT Based Inference for Linear Regression

Population model:

\[ {y} = \beta_0 + \beta_1~x + \epsilon \]

Sample model that we use to estimate the population model:

\[ \hat{y} = b_0 + b_1~x \]

Similar to other sample statistics (mean, proportion, etc) there is variability in our estimates of the slope and intercept.

Do we have convincing evidence that the true linear model has a non-zero slope?

What is a confidence interval for $\beta_1$?

Let’s return to our cars model:

ggplot(
  sports_car_prices,
  aes(x = mileage, y = price)
) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Mileage", y = "Price", title = "Price versus Mileage",
    subtitle = "sports cars from cars.com"
  )

Tidy Confidence Interval

\[\text{point estimate} \pm \text{critical value} \times \text{SE}\]

\[b_1 \pm t^*_{n-2} \times SE_{b_1}\]

tidy(model_mileage)

## # A tibble: 2 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 62928.    3169.         19.9  1.50e-27
## 2 mileage        -0.613    0.0762     -8.04 5.26e-11

A 95% confidence interval for $\beta_1$ is given by

tidy(model_mileage, conf.int = TRUE, conf.level = 0.95) %>%
  filter(term == "mileage") %>% 
   select(starts_with("conf"))

## # A tibble: 1 x 2
##   conf.low conf.high
##      <dbl>     <dbl>
## 1   -0.765    -0.460

Alternatively, you can use the model output directly to construct a 95% confidence interval.

-0.613 + c(-1, 1) * qt(0.975, 58) * 0.0762

## [1] -0.7655309 -0.4604691

But the tidy() / confint_tidy() methods from broom are preferred.

We are 95% confident that for every additional ten thousand miles a used sports car has been driven, the price is expected to decrease, on average, between $4,600 and $7,650.

Hypothesis testing for $\beta_1$

Is there convincing evidence, based on our sample data, that mileage is associated with price?

We can set this up as a hypothesis test, with the hypotheses below.

$H_o: \beta_1 = 0$. There is no relationship, the slope is 0.

$H_a: \beta_1 \neq 0$. There is a relationship between price and mileage.

We only reject $H_o$ in favor of $H_a$ if the data provide strong evidence that the true slope parameter is different from zero.

\[T = \frac{b_1 - 0}{SE_{b_1}} \sim t_{n-2}\]

tidy(model_mileage)

## # A tibble: 2 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 62928.    3169.         19.9  1.50e-27
## 2 mileage        -0.613    0.0762     -8.04 5.26e-11

The $p$-values in the tidy() output represent the two-sided $p$-value associated with the observed statistic and the hypotheses $H_o: \beta_k = 0$ and $H_a: \beta_k \neq 0$.

Based on the above result we reject $H_o$ in favor of $H_a$. We have enough evidence to suggest that there is an association between price and mileage.

Practice

A silviculturist is interested in finding the volume of black cherry trees in order to determine timber yield. This is difficult to measure, so the researcher uses regression to predict volume (cubic feet) based on girth (inches) for a random sample of 31 trees that have been cut down.

Girth is easy to measure so the idea is one can hopefully have a reasonable prediction for volume based on girth. Data taken from OpenIntro Statistics (4th edition).

trees <- read_csv("data/trees.csv")

Create a scatterplot of Volume versus Girth.

ggplot(trees, 
       aes(x = Girth, y = Volume)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Girth (cm)", y = "Volume (cubic feet)",
       title = "Volume versus girth",
       subtitle = "for 31 cherry trees")

Fit a linear regression model predicting Volume based on Girth and display the model output in tidy format.

model_girth <- lm(Volume ~ Girth, data = trees)
tidy(model_girth)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -36.9      3.37      -11.0 7.62e-12
## 2 Girth           5.07     0.247      20.5 8.64e-19

Write out the equation of the estimated regression line.

\[\widehat{Volume} = -36.9 + 5.07~Girth\]

Create a 95% confidence interval for the population slope for girth and provide an interpretation in the context of the problem.

tidy(model_girth, conf.int = TRUE, conf.level = 0.95)

## # A tibble: 2 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   -36.9      3.37      -11.0 7.62e-12   -43.8     -30.1 
## 2 Girth           5.07     0.247      20.5 8.64e-19     4.56      5.57

We are 95% confident that for every one inch increase in girth, volume is expected to increase, on average, between 4.56 and 5.57 cubic feet.

Is there evidence of a relationship between girth and volume? Answer formally using a CLT-based hypothesis test.

We reject the null hypothesis that $\beta_1 = 0$ in favor of the alternative that $\beta_1 \neq 0$. We have enough evidence that there is an assocation between girth and volume.