Main ideas
- Discuss model selection
- Introduce logistic regression
Adjusted \(R^2\) is only one model selection criterion. There are many others - AIC and BIC are two commonly used ones.
Multiple regression allows us to relate a numerical response variable to one or more numerical or categorical predictors. We can use multiple regression models to understand relationships, assess differences, and make predictions.
But what about a situation where the response of interest is categorical and binary?
On April 15, 1912 the famous ocean liner Titanic sank in the North Atlantic after striking an iceberg on its maiden voyage. The dataset titanic.csv
contains the survival status and other attributes of individuals on the titanic.
survived
: survival status (0 = died, 1 = survived)pclass
: passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)name
: name of individualsex
: sex (male or female)age
: age in yearsfare
: passenger fare in British poundsWe are interested in investigating the variables that contribute to passenger survival. Do women and children really come first?
Let’s load our data and then look at it.
titanic <- read_csv("data/titanic.csv")
glimpse(titanic)
## Rows: 887
## Columns: 6
## $ pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ name <chr> "Mr. Owen Harris Braund", "Mrs. John Bradley (Florence Briggs…
## $ sex <chr> "male", "female", "female", "female", "male", "male", "male",…
## $ age <dbl> 22, 38, 26, 35, 35, 27, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
## $ fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
titanic %>%
mutate(Survival = if_else(survived == 1, "Survived", "Died")) %>%
ggplot(aes(x = sex, fill = Survival)) +
geom_bar(position = "fill") +
labs(x = "Sex", y = "")
titanic %>%
mutate(Survival = if_else(survived == 1, "Survived", "Died")) %>%
ggplot(aes(x = Survival, y = age)) +
geom_boxplot() +
labs(y = "Age")
\[ y = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k + \epsilon \]
\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]
Denote by \(p\) the probability of death and consider the model below.
\[ p = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k + \epsilon\] Can you see any problems with this approach?
lm_survival <- lm(survived ~ age + sex, data = titanic)
tidy(lm_survival)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.752 0.0356 21.1 2.88e-80
## 2 age -0.000343 0.000979 -0.350 7.26e- 1
## 3 sexmale -0.551 0.0289 -19.1 3.50e-68
ggplot(
titanic,
aes(x = age, y = survived, color = sex)
) +
geom_jitter(width = 0.05, height = .05, alpha = .5) +
geom_abline(intercept = 0.799, slope = 0.000343, color = "#F57670", lwd = 1) +
geom_abline(intercept = 0.248, slope = 0.000343, color = "#1FBEC3", lwd = 1)
lm_survival_aug <- augment(lm_survival)
ggplot(
data = lm_survival_aug,
aes(x = .fitted, y = .resid)
) +
geom_point() +
labs(x = "Predicted", y = "Residuals") +
geom_hline(yintercept = 0, lwd = 2, lty = 2, col = "red")
ggplot(
lm_survival_aug,
aes(x = age, y = .fitted)
) +
geom_point() +
geom_hline(yintercept = c(0, 1), lwd = 2, lty = 2, col = "red") +
labs(x = "Age", y = "Predicted")
This isn’t helpful! We need to develop a new tool.
Odds are sometimes expressed as X : Y and read X to Y.
It is the ratio of successes to failures, where values larger than 1 favor a success and values smaller than 1 favor a failure.
If \(P(A) = 1/2\), what are the odds of \(A\)? 1
If \(P(B) = 1/3\) what are the odds of \(B\)? .5
An odds ratio is a ratio of odds.
\[\text{logit}(p) = \text{log}\left(\frac{p}{1-p}\right)\]
The logit takes a value of \(p\) between 0 and 1 and outputs a value between \(-\infty\) and \(\infty\).
The inverse logit (logistic) takes a value between \(-\infty\) and \(\infty\) and outputs a value between 0 and 1.
\[\text{inverse logit}(x) = \frac{e^x}{1+e^x} = \frac{1}{1 + e^{-x}}\]
There is a one-to-one relationship between probabilities and log-odds. If we create a model using the log-odds we can “work backwards” using the logistic function to obtain probabilities between 0 and 1.
\[\text{log}\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}\]
Use the inverse logit to find the expression for \(p\).
\[p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}}}\]
We can use the logistic regression model to obtain predicted probabilities of success for a binary response variable.
We handle fitting the model via computer using the glm
function.
logit_mod <- glm(survived ~ sex + age,
data = titanic,
family = "binomial"
)
tidy(logit_mod)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.11 0.208 5.34 9.05e- 8
## 2 sexmale -2.50 0.168 -14.9 3.24e-50
## 3 age -0.00206 0.00586 -0.351 7.25e- 1
And use augment
to find predicted log-odds.
pred_log_odds <- augment(logit_mod)
pred_log_odds
## # A tibble: 887 x 9
## survived sex age .fitted .resid .std.resid .hat .sigma .cooksd
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 male 22 -1.43 -0.655 -0.655 0.00212 1.02 0.000170
## 2 1 female 38 1.04 0.779 0.781 0.00389 1.02 0.000464
## 3 1 female 26 1.06 0.771 0.772 0.00320 1.02 0.000372
## 4 1 female 35 1.04 0.777 0.779 0.00354 1.02 0.000419
## 5 0 male 35 -1.46 -0.647 -0.647 0.00186 1.02 0.000145
## 6 0 male 27 -1.44 -0.652 -0.652 0.00181 1.02 0.000143
## 7 0 male 54 -1.50 -0.635 -0.637 0.00461 1.02 0.000347
## 8 0 male 2 -1.39 -0.667 -0.669 0.00617 1.02 0.000518
## 9 1 female 27 1.06 0.772 0.773 0.00319 1.02 0.000371
## 10 1 female 14 1.09 0.763 0.765 0.00440 1.02 0.000500
## # … with 877 more rows
tidy(logit_mod)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.11 0.208 5.34 9.05e- 8
## 2 sexmale -2.50 0.168 -14.9 3.24e-50
## 3 age -0.00206 0.00586 -0.351 7.25e- 1
\[\text{log}\left(\frac{\hat{p}}{1-\hat{p}}\right) = 1.11 - 2.50~sex - 0.00206~age\] \[\hat{p} = \frac{e^{1.11 - 2.50~sex - 0.00206~age}}{{1+e^{1.11 - 2.50~sex - 0.00206~age}}}\]
\[\text{log}\left(\frac{\hat{p}}{1-\hat{p}}\right) = 1.11 - 2.50~sex - 0.00206~age\]
Holding sex constant, for every additional year of age, we expect the log-odds of survival to decrease by approximately 0.002.
Holding age constant, we expect males to have a log-odds of survival that is 2.50 less than females.
\[\frac{\hat{p}}{1-\hat{p}} = e^{1.11 - 2.50~sex - 0.00206~age}\]
Holding sex constant, for every one year increase in age, the odds of survival is expected to be multiplied by \(e^{-0.00206} = 0.998\).
Holding age constant, the odds of survival for males is \(e^{-2.50} = 0.082\) times the odds of survival for females.
Let’s say you’re interested in the probability of survival of someone of a specific age and sex. You can also calculate this to get a specific number.
pred_data <- tibble(
age = 32,
sex = "male"
)
augment(logit_mod, newdata = pred_data) %>%
mutate(p = exp(.fitted) / (1 + exp(.fitted)))
## # A tibble: 1 x 4
## age sex .fitted p
## <dbl> <chr> <dbl> <dbl>
## 1 32 male -1.45 0.190
glance(logit_mod)
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 1183. 886 -458. 922. 936. 916. 884 887
glance
, you can see a variety of measures of model fit.model_2 <- glm(survived ~ sex + age + fare,
data = titanic,
family = "binomial"
)
tidy(model_2)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.838 0.215 3.89 9.87e- 5
## 2 sexmale -2.39 0.171 -14.0 2.47e-44
## 3 age -0.00740 0.00604 -1.23 2.20e- 1
## 4 fare 0.0116 0.00234 4.96 7.23e- 7
Holding all else constant, the odds of survival for males is \(e^{-2.39} = 0.0916\) times the odds of survival for females.
Holding all else constant, for every one year increase in age, the odds of survival is expected to be multiplied by \(e^{-0.00740} = 0.993\).
Holding all else constant, for every one pound increase in fare, the odds of survival is expected to be multiplied by \(e^{0.0116} = 1.011\).
e to the fitted slope is an odds ratio
tibble(
age = c(40, 40),
sex = c("male", "male"),
fare = c(100, 500)
) %>%
augment(model_2, newdata = .) %>%
mutate(p = exp(.fitted) / (1 + exp(.fitted)))
## # A tibble: 2 x 5
## age sex fare .fitted p
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 40 male 100 -0.692 0.334
## 2 40 male 500 3.94 0.981
mean_age <- titanic %>%
summarize(mean_age = mean(age)) %>%
pull(mean_age)
tibble(
age = rep(mean_age, times = 1002),
sex = rep(c("male", "female"), each = 501),
fare = rep(0:500, times = 2)
) %>%
augment(model_2, newdata = .) %>%
mutate(p = exp(.fitted) / (1 + exp(.fitted))) %>%
ggplot(aes(x = fare, y = p, color = sex)) +
geom_line() +
theme_bw()