Main ideas
- Introduce statistical models and learn what they are used for
- Learn the language of linear models
In addition to the tidyverse
, we will use broom
, which tidies up regression output.
library(tidyverse)
library(broom)
The data we will examine is in sportscars.csv
, which contains prices for Porsche and Jaguar cars for sale on cars.com, taken from the Stat2Data
package.
car
: car make (Jaguar or Porsche)price
: car price in USDage
: age of the car in yearsmileage
: previous miles drivensports_car_prices <- read_csv("data/sportscars.csv")
glimpse(sports_car_prices)
## Rows: 60
## Columns: 4
## $ mileage <dbl> 21500, 43000, 19900, 36000, 44000, 49800, 1300, 670, 13400, 97…
## $ price <dbl> 69400, 56900, 49900, 47400, 42900, 36900, 83000, 72900, 69900,…
## $ age <dbl> 3, 3, 2, 4, 4, 6, 0, 0, 2, 0, 2, 2, 4, 3, 10, 11, 4, 4, 10, 3,…
## $ car <chr> "Porsche", "Porsche", "Porsche", "Porsche", "Porsche", "Porsch…
Let’s do a brief exploratory data analysis, focusing on price
and mileage
.
ggplot(
sports_car_prices,
aes(x = mileage)
) +
geom_histogram(bins = 20) +
labs(x = "Mileage", y = "Count", title = "Distribution of Miles Driven")
ggplot(
sports_car_prices,
aes(x = price)
) +
geom_histogram(bins = 20) +
labs(x = "Price", y = "Count", title = "Distribution of Price")
ggplot(
sports_car_prices,
aes(x = age, y = price)
) +
geom_point() +
labs(
x = "Mileage", y = "Price", title = "Price versus Mileage",
subtitle = "sports cars from cars.com"
)
What do you notice about price, mileage, and the relationship between price and mileage?
Unsurprisingly, increases in mileage of used sports cars are associated with decreases in price.
Models as functions. We can represent relationships between variables as functions. A mathematical function is the relationship between one or more inputs and an output from those inputs.
Basically, plug in the inputs and receive back the output.
\[y = 7 +3 x\]
x <- 5
y <- 7 + 3 * x
y
## [1] 22
x <- 12
y <- 7 + 3 * x
y
## [1] 43
ggplot(
sports_car_prices,
aes(x = mileage, y = price)
) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "Mileage", y = "Price", title = "Price versus Age",
subtitle = "sports cars from cars.com"
)
What is method = "lm"
doing in the code chunk above?
Adding a line to the scatterplot.
Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis. Also called the dependent variable.
Explanatory variable: Other variables that you want to use to explain the variation in the response, on the x-axis. Also called independent variables, predictors, or features.
Predicted value: Output of the model function.
Residuals: Shows how far each case is from its predicted value.
Question: What does a negative residual mean? Which cars have negative residuals, those above or below the line?
Does the relationship between mileage and price vary depending on the car type?
ggplot(
data = sports_car_prices,
aes(x = mileage, y = price, color = car)
) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
We will talk about multiple explanatory variables in the coming weeks.
Statistics is the explanation of variation in the context of what remains unexplained.
The scatter around the line suggests that there are other factors accounting for variability in car sales price or that randomness plays a big role.
Adding meaningful explanatory variables can sometimes reduce this variation.
Understand relationships: Characterize the relationship between y and x via slopes for numerical explanatory variables or differences for categorical explanatory variables.
Assess differences: Employ our inference tools in a linear regression context (later).
Make predictions: Plug in x, get the predicted y.
Here, we will be using the broom
package to clean up our regression output:
broom
follows tidyverse principles and tidies up regression outputtidy
: Constructs a tidy data frame summarizing model’s statistical findingsglance
: Constructs a concise one-row summary of the modelaugment
: Adds columns (e.g. predictions, residuals) to the original data that was modeledWhat is the effect of an additional mile driven on a car’s sales price?
model_mileage <- lm(price ~ mileage, data = sports_car_prices)
tidy(model_mileage)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 62928. 3169. 19.9 1.50e-27
## 2 mileage -0.613 0.0762 -8.04 5.26e-11
\[\widehat{Price} = 62928 - 0.613~Mileage\]
Slope: For every additional mile a car is driven, the price is expected to decrease, on average, by $0.613. Alternatively, and more usefully, for every additional ten thousand miles a car is driven, the price is expected to decrease, on average, 6130 dollars.
Intercept: Cars that have a mileage of 0 are expected to have a sales price of 62928 dollars.
The interpretation of the intercept doesn’t make a lot of sense.
We’re interested in the \(\beta_0\) (population parameter for the intercept) and the \(\beta_1\) (population parameter for the slope) in the following model:
\[ {y} = \beta_0 + \beta_1~x + \epsilon \]
Unfortunately, we can’t get these values, so we use the sample data to estimate them.
\[ \hat{y} = b_0 + b_1~x \]
The regression line minimizes the sum of squared residuals.
Residuals: Residuals are data - fits, or \(e_i = y_i - \hat{y}_i\)
The regression line minimizes \(\sum_{i = 1}^n e_i^2\).
Equivalently, minimizing \(\sum_{i = 1}^n [y_i - (b_0 + b_1~x_i)]^2\).
Question: Why do we minimize the squares of the residuals?
Let’s visualize the residuals on the plot below.
ggplot(
sports_car_prices,
aes(x = mileage, y = price)
) +
geom_point() +
labs(
x = "Mileage", y = "Price", title = "Price versus Age",
subtitle = "sports cars from cars.com"
)
Check out the applet here to see this process in action.
The estimate for the slope, \(b_1\), has the same sign as the correlation between the two variables.
The regression line goes through the center of mass point, the coordinates corresponding to average \(x\) and average \(y\): \((\bar{x}, \bar{y})\).
The sum of the residuals is zero, \(\sum_{i = 1}^n e_i = 0\).
The residuals and \(x\) values are uncorrelated.
Let’s say we want to see if Jaguars have a higher selling price than Porsches. Our car
variable has two categories: Porsche and Jaguar.
Let’s run a model using car
to predict price
.
model_car <- lm(price ~ car, data = sports_car_prices)
tidy(model_car)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 31957. 2954. 10.8 1.56e-15
## 2 carPorsche 18580. 4178. 4.45 4.00e- 5
\[\widehat{Price} = 31957 + 18580~Car\]
Plug = 1 (Porsche):
\[\widehat{Price} = 31957 + 18580 \times 1\]
Plug in 0 (Jaguar):
\[\widehat{Price} = 31957 + 18580 \times 0\]
For a categorical variable with many levels, the levels are coded as dummy variables Each coefficient describes the expected difference comparing a particular level to the baseline level of 0.
Use model_mileage
to make predictions.
On average, how expensive are cars with 50,000 miles? Examine the plot to make a prediction.
tidy(model_mileage)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 62928. 3169. 19.9 1.50e-27
## 2 mileage -0.613 0.0762 -8.04 5.26e-11
62928 + -0.613 * 50000
## [1] 32278
On average, we expect cars that have 50,000 miles to sell for $32,278. Note we expect this to happen but shouldn’t be surprised by some variability. We’ll discuss this later.
On average, how expensive are cars with 100,000 miles? Examine the plot to make a prediction.
62928 + -0.613 * 100000
## [1] 1628
This prediction is an extrapolation well beyond the range of the data we have, and there is no guarantee that the linear relationship continues. This prediction is not reliable.
tidy()
: Constructs a data frame that summarizes the model’s statistical findings. We’ve talked about coefficient estimates and standard errors, but it also displays test statistics and p-values (more on these in a few weeks!).augment()
: Adds columns to the original data that was modeled. This includes predictions and residuals.glance()
: Constructs a concise one-row summary of the model, computed once for the entire model.# untidy
model_mileage
summary(model_mileage)
# tidy
tidy(model_mileage)
tidy(model_mileage) %>%
select(term, estimate) %>%
mutate(estimate = round(estimate, 3))
ggplot()
to visualize the relationship between price
and age
and describe the relationship.ggplot(
sports_car_prices,
aes(x = age, y = price)
) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
price
as the response and age
as a predictor. Report the output in tidy format and write out the model equation.model_age <- lm(price ~ age, data = sports_car_prices)
tidy(model_age)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 53246. 3322. 16.0 5.70e-23
## 2 age -2149. 466. -4.62 2.22e- 5
\[\widehat{Price} = 53246 -2149~\text{Age}\]
For every one year increase in a car’s age, price is expected to decrease, on average, $2,149.
Cars with an age of 0 are expected to cost, on average $53,246.
ggplot(
sports_car_prices,
aes(x = age, y = price, color = car)
) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)