Statistical Models

We use models to…
- understand relationships
- assess differences
- make predictions
We will focus on linear models but there are many other types.

Packages and Data

In addition to the tidyverse, we will use broom, which tidies up regression output.

library(tidyverse)
library(broom)

The data we will examine is in sportscars.csv, which contains prices for Porsche and Jaguar cars for sale on cars.com, taken from the Stat2Data package.

car: car make (Jaguar or Porsche)
price: car price in USD
age: age of the car in years
mileage: previous miles driven

sports_car_prices <- read_csv("data/sportscars.csv")
glimpse(sports_car_prices)

## Rows: 60
## Columns: 4
## $ mileage <dbl> 21500, 43000, 19900, 36000, 44000, 49800, 1300, 670, 13400, 97…
## $ price   <dbl> 69400, 56900, 49900, 47400, 42900, 36900, 83000, 72900, 69900,…
## $ age     <dbl> 3, 3, 2, 4, 4, 6, 0, 0, 2, 0, 2, 2, 4, 3, 10, 11, 4, 4, 10, 3,…
## $ car     <chr> "Porsche", "Porsche", "Porsche", "Porsche", "Porsche", "Porsch…

Let’s do a brief exploratory data analysis, focusing on price and mileage.

ggplot(
  sports_car_prices,
  aes(x = mileage)
) +
  geom_histogram(bins = 20) +
  labs(x = "Mileage", y = "Count", title = "Distribution of Miles Driven")

ggplot(
  sports_car_prices,
  aes(x = price)
) +
  geom_histogram(bins = 20) +
  labs(x = "Price", y = "Count", title = "Distribution of Price")

ggplot(
  sports_car_prices,
  aes(x = age, y = price)
) +
  geom_point() +
  labs(
    x = "Mileage", y = "Price", title = "Price versus Mileage",
    subtitle = "sports cars from cars.com"
  )

What do you notice about price, mileage, and the relationship between price and mileage?

Unsurprisingly, increases in mileage of used sports cars are associated with decreases in price.

Models as functions. We can represent relationships between variables as functions. A mathematical function is the relationship between one or more inputs and an output from those inputs.

Basically, plug in the inputs and receive back the output.

\[y = 7 +3 x\]

x <- 5
y <- 7 + 3 * x
y

## [1] 22

x <- 12
y <- 7 + 3 * x
y

## [1] 43

ggplot(
  sports_car_prices,
  aes(x = mileage, y = price)
) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Mileage", y = "Price", title = "Price versus Age",
    subtitle = "sports cars from cars.com"
  )

What is method = "lm" doing in the code chunk above?

Adding a line to the scatterplot.

Vocabulary

Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis. Also called the dependent variable.
Explanatory variable: Other variables that you want to use to explain the variation in the response, on the x-axis. Also called independent variables, predictors, or features.
Predicted value: Output of the model function.
- the model function gives the typical value of the response variable conditioning on the explanatory variables.
Residuals: Shows how far each case is from its predicted value.
- residual = observed value - predicted value
- residual = data - fit
- how far above or below the model function each case is

Question: What does a negative residual mean? Which cars have negative residuals, those above or below the line?

Multiple Explanatory Variables

Does the relationship between mileage and price vary depending on the car type?

ggplot(
  data = sports_car_prices,
  aes(x = mileage, y = price, color = car)
) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

We will talk about multiple explanatory variables in the coming weeks.

Variation in the Model

This is just as important as the model, if not more!

Statistics is the explanation of variation in the context of what remains unexplained.

The scatter around the line suggests that there are other factors accounting for variability in car sales price or that randomness plays a big role.
Adding meaningful explanatory variables can sometimes reduce this variation.

How do we use models?

Understand relationships: Characterize the relationship between y and x via slopes for numerical explanatory variables or differences for categorical explanatory variables.
Assess differences: Employ our inference tools in a linear regression context (later).
Make predictions: Plug in x, get the predicted y.

Understand relationships

Here, we will be using the broom package to clean up our regression output:

broom follows tidyverse principles and tidies up regression output
tidy: Constructs a tidy data frame summarizing model’s statistical findings
glance: Constructs a concise one-row summary of the model
augment: Adds columns (e.g. predictions, residuals) to the original data that was modeled

What is the effect of an additional mile driven on a car’s sales price?

model_mileage <- lm(price ~ mileage, data = sports_car_prices)
tidy(model_mileage)

## # A tibble: 2 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 62928.    3169.         19.9  1.50e-27
## 2 mileage        -0.613    0.0762     -8.04 5.26e-11

\[\widehat{Price} = 62928 - 0.613~Mileage\]

Slope: For every additional mile a car is driven, the price is expected to decrease, on average, by $0.613. Alternatively, and more usefully, for every additional ten thousand miles a car is driven, the price is expected to decrease, on average, 6130 dollars.
Intercept: Cars that have a mileage of 0 are expected to have a sales price of 62928 dollars.

The interpretation of the intercept doesn’t make a lot of sense.

The linear model with a single predictor

We’re interested in the $\beta_0$ (population parameter for the intercept) and the $\beta_1$ (population parameter for the slope) in the following model:

\[ {y} = \beta_0 + \beta_1~x + \epsilon \]

Unfortunately, we can’t get these values, so we use the sample data to estimate them.

\[ \hat{y} = b_0 + b_1~x \]

Least squares regression

The regression line minimizes the sum of squared residuals.

Residuals: Residuals are data - fits, or $e_i = y_i - \hat{y}_i$
The regression line minimizes $\sum_{i = 1}^n e_i^2$.
Equivalently, minimizing $\sum_{i = 1}^n [y_i - (b_0 + b_1~x_i)]^2$.

Question: Why do we minimize the squares of the residuals?

Visualizing Residuals

Let’s visualize the residuals on the plot below.

ggplot(
  sports_car_prices,
  aes(x = mileage, y = price)
) +
  geom_point() +
  labs(
    x = "Mileage", y = "Price", title = "Price versus Age",
    subtitle = "sports cars from cars.com"
  )

Check out the applet here to see this process in action.

Properties of the least squares regression line

The estimate for the slope, $b_1$, has the same sign as the correlation between the two variables.
The regression line goes through the center of mass point, the coordinates corresponding to average $x$ and average $y$: $(\bar{x}, \bar{y})$.
The sum of the residuals is zero, $\sum_{i = 1}^n e_i = 0$.
The residuals and $x$ values are uncorrelated.

Categorical Predictors

Let’s say we want to see if Jaguars have a higher selling price than Porsches. Our car variable has two categories: Porsche and Jaguar.

Let’s run a model using car to predict price.

model_car <- lm(price ~ car, data = sports_car_prices)
tidy(model_car)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   31957.     2954.     10.8  1.56e-15
## 2 carPorsche    18580.     4178.      4.45 4.00e- 5

\[\widehat{Price} = 31957 + 18580~Car\]

Plug = 1 (Porsche):

\[\widehat{Price} = 31957 + 18580 \times 1\]

Plug in 0 (Jaguar):

\[\widehat{Price} = 31957 + 18580 \times 0\]

Slope: Porsches are expected to cost, on average, $18,580 more than Jaguars.
- compares baseline level (Jaguar = 0) to other level (Porsche = 1).
Intercept: Jaguars are expected to cost, on average, $31,957.

For a categorical variable with many levels, the levels are coded as dummy variables Each coefficient describes the expected difference comparing a particular level to the baseline level of 0.

Making Predictions

Use model_mileage to make predictions.

On average, how expensive are cars with 50,000 miles? Examine the plot to make a prediction.

tidy(model_mileage)

## # A tibble: 2 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 62928.    3169.         19.9  1.50e-27
## 2 mileage        -0.613    0.0762     -8.04 5.26e-11

62928 + -0.613 * 50000

## [1] 32278

On average, we expect cars that have 50,000 miles to sell for $32,278. Note we expect this to happen but shouldn’t be surprised by some variability. We’ll discuss this later.

On average, how expensive are cars with 100,000 miles? Examine the plot to make a prediction.

62928 + -0.613 * 100000

## [1] 1628

This prediction is an extrapolation well beyond the range of the data we have, and there is no guarantee that the linear relationship continues. This prediction is not reliable.

Tidy Output

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

tidy(): Constructs a data frame that summarizes the model’s statistical findings. We’ve talked about coefficient estimates and standard errors, but it also displays test statistics and p-values (more on these in a few weeks!).
augment(): Adds columns to the original data that was modeled. This includes predictions and residuals.
glance(): Constructs a concise one-row summary of the model, computed once for the entire model.

# untidy
model_mileage
summary(model_mileage)

# tidy
tidy(model_mileage)
tidy(model_mileage) %>%
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 3))

Practice

Use ggplot() to visualize the relationship between price and age and describe the relationship.

ggplot(
  sports_car_prices,
  aes(x = age, y = price)
) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Fit a linear regression model with price as the response and age as a predictor. Report the output in tidy format and write out the model equation.

model_age <- lm(price ~ age, data = sports_car_prices)
tidy(model_age)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   53246.     3322.     16.0  5.70e-23
## 2 age           -2149.      466.     -4.62 2.22e- 5

\[\widehat{Price} = 53246 -2149~\text{Age}\]

Carefully the slope and intercept in the context of the problem.

For every one year increase in a car’s age, price is expected to decrease, on average, $2,149.

Cars with an age of 0 are expected to cost, on average $53,246.

What is missing in your model that might help make predictions? Hint: color the points by type: either Jaguar or Porsche.

ggplot(
  sports_car_prices,
  aes(x = age, y = price, color = car)
) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

Additional Resources

https://r4ds.had.co.nz/model-basics.html

Linear Regression I

3-23-2021

Song of the Day

Main ideas

Coming up