Main ideas
- Review and expand on regression ideas.
- Introduce inference for linear regression.
library(tidyverse)
library(broom)
We’ll continue our work with the sports_cars
data.
sports_car_prices <- read_csv("data/sportscars.csv")
Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis. Also called the dependent variable.
Explanatory variable: Other variables that you want to use to explain the variation in the response, on the x-axis. Also called independent variables, predictors, or features.
Predicted value: Output of the model function.
Residuals: Shows how far each case is from its predicted value.
We’re interested in \(\beta_0\) (population parameter for the intercept) and \(\beta_1\) (population parameter for the slope) in the following model:
\[ {y} = \beta_0 + \beta_1~x + \epsilon \]
Unfortunately, we can’t get these values, so we use the sample data to estimate them.
\[ \hat{y} = b_0 + b_1~x \]
The regression line minimizes the sum of squared residuals.
Residuals: Residuals are data - fits, or \(e_i = y_i - \hat{y}_i\)
The regression line minimizes \(\sum_{i = 1}^n e_i^2\).
Equivalently, minimizing \(\sum_{i = 1}^n [y_i - (b_0 + b_1~x_i)]^2\).
Let’s return to our mileage model from Tuesday.
model_mileage <- lm(price ~ mileage, data = sports_car_prices)
tidy(model_mileage)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 62928. 3169. 19.9 1.50e-27
## 2 mileage -0.613 0.0762 -8.04 5.26e-11
\[\widehat{Price} = 62928 - 0.613~Mileage\]
The strength of the fit of a linear model is commonly evaluated using \(R^2\).
It tells us what percentage of the variability in the response variable is explained by the model. The remainder of the variability is unexplained.
\(R^2\) is sometimes called the coefficient of determination.
\(R^2\) can only ever increase when we add variables to the model.
glance(model_mileage)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.527 0.519 12887. 64.6 5.26e-11 1 -652. 1310. 1316.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(model_mileage) %>%
select(r.squared)
## # A tibble: 1 x 1
## r.squared
## <dbl>
## 1 0.527
Approximately 52.7% of the variability in the price of used sports can be explained by mileage.
ggplot(data = sports_car_prices,
aes(x = mileage, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
sports_car_prices %>%
summarize(cor = cor(price, mileage)) %>%
mutate(r2 = cor^2)
## # A tibble: 1 x 2
## cor r2
## <dbl> <dbl>
## 1 -0.726 0.527
We can write explained variation using the following ratio of sums of squares:
\[ R^2 = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right) \]
where \(SS_{Error}\) is the sum of squared residuals and \(SS_{Total}\) is the total variance in the response variable. But remember, adding any explanatory variables will increase \(R^2\).
Next class, when we talk about multiple regression, we will discuss another measure of model fit called adjusted \(R^2\) that is preferable for models with many explanatory variables.
\[ {y} = \beta_0 + \beta_1~x + \epsilon \]
\[ \hat{y} = b_0 + b_1~x \]
Similar to other sample statistics (mean, proportion, etc) there is variability in our estimates of the slope and intercept.
Do we have convincing evidence that the true linear model has a non-zero slope?
What is a confidence interval for \(\beta_1\)?
Let’s return to our cars model:
ggplot(
sports_car_prices,
aes(x = mileage, y = price)
) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "Mileage", y = "Price", title = "Price versus Mileage",
subtitle = "sports cars from cars.com"
)
\[\text{point estimate} \pm \text{critical value} \times \text{SE}\]
\[b_1 \pm t^*_{n-2} \times SE_{b_1}\]
tidy(model_mileage)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 62928. 3169. 19.9 1.50e-27
## 2 mileage -0.613 0.0762 -8.04 5.26e-11
A 95% confidence interval for \(\beta_1\) is given by
tidy(model_mileage, conf.int = TRUE, conf.level = 0.95) %>%
filter(term == "mileage") %>%
select(starts_with("conf"))
## # A tibble: 1 x 2
## conf.low conf.high
## <dbl> <dbl>
## 1 -0.765 -0.460
Alternatively, you can use the model output directly to construct a 95% confidence interval.
-0.613 + c(-1, 1) * qt(0.975, 58) * 0.0762
## [1] -0.7655309 -0.4604691
But the tidy()
/ confint_tidy()
methods from broom
are preferred.
We are 95% confident that for every additional ten thousand miles a used sports car has been driven, the price is expected to decrease, on average, between $4,600 and $7,650.
Is there convincing evidence, based on our sample data, that mileage is associated with price?
We can set this up as a hypothesis test, with the hypotheses below.
\(H_o: \beta_1 = 0\). There is no relationship, the slope is 0.
\(H_a: \beta_1 \neq 0\). There is a relationship between price and mileage.
We only reject \(H_o\) in favor of \(H_a\) if the data provide strong evidence that the true slope parameter is different from zero.
\[T = \frac{b_1 - 0}{SE_{b_1}} \sim t_{n-2}\]
tidy(model_mileage)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 62928. 3169. 19.9 1.50e-27
## 2 mileage -0.613 0.0762 -8.04 5.26e-11
The \(p\)-values in the tidy()
output represent the two-sided \(p\)-value associated with the observed statistic and the hypotheses \(H_o: \beta_k = 0\) and \(H_a: \beta_k \neq 0\).
Based on the above result we reject \(H_o\) in favor of \(H_a\). We have enough evidence to suggest that there is an association between price and mileage.
A silviculturist is interested in finding the volume of black cherry trees in order to determine timber yield. This is difficult to measure, so the researcher uses regression to predict volume (cubic feet) based on girth (inches) for a random sample of 31 trees that have been cut down.
Girth is easy to measure so the idea is one can hopefully have a reasonable prediction for volume based on girth. Data taken from OpenIntro Statistics (4th edition).
trees <- read_csv("data/trees.csv")
Volume
versus Girth
.ggplot(trees,
aes(x = Girth, y = Volume)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Girth (cm)", y = "Volume (cubic feet)",
title = "Volume versus girth",
subtitle = "for 31 cherry trees")
Volume
based on Girth
and display the model output in tidy format.model_girth <- lm(Volume ~ Girth, data = trees)
tidy(model_girth)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -36.9 3.37 -11.0 7.62e-12
## 2 Girth 5.07 0.247 20.5 8.64e-19
\[\widehat{Volume} = -36.9 + 5.07~Girth\]
tidy(model_girth, conf.int = TRUE, conf.level = 0.95)
## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -36.9 3.37 -11.0 7.62e-12 -43.8 -30.1
## 2 Girth 5.07 0.247 20.5 8.64e-19 4.56 5.57
We are 95% confident that for every one inch increase in girth, volume is expected to increase, on average, between 4.56 and 5.57 cubic feet.
We reject the null hypothesis that \(\beta_1 = 0\) in favor of the alternative that \(\beta_1 \neq 0\). We have enough evidence that there is an assocation between girth and volume.