Logistic Regression

Song of the Day

See Me, Feel Me / Listening to You - The Who

Main ideas

Discuss model selection
Introduce logistic regression

Coming up

Lab #09 on Friday
Homework #05 opens on Tuesday
Statistical Experience
Project work
- Peer review

Model selection

Backwards elimination

Start with a full model (including all candidate explanatory variables and all candidate interactions).
Remove one variable at a time, and select the model with the highest adjusted \(R^2\).
Continue until adjusted \(R^2\) does not increase.

Forward selection

Start with an empty model.
Add one variable (or interaction effect) at a time, and select the model with the highest adjusted \(R^2\).
Continue until adjusted \(R^2\) does not increase.

Adjusted \(R^2\) is only one model selection criterion. There are many others - AIC and BIC are two commonly used ones.

Logistic Regression

Multiple regression allows us to relate a numerical response variable to one or more numerical or categorical predictors. We can use multiple regression models to understand relationships, assess differences, and make predictions.

But what about a situation where the response of interest is categorical and binary?

spam or not spam
malignant or benign tumor
survived or died
admitted or denied
won or lost an election

Today’s Data: A Night to Remember

On April 15, 1912 the famous ocean liner Titanic sank in the North Atlantic after striking an iceberg on its maiden voyage. The dataset titanic.csv contains the survival status and other attributes of individuals on the titanic.

survived: survival status (0 = died, 1 = survived)
pclass: passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
name: name of individual
sex: sex (male or female)
age: age in years
fare: passenger fare in British pounds

We are interested in investigating the variables that contribute to passenger survival. Do women and children really come first?

Data and Packages

Let’s load our data and then look at it.

titanic <- read_csv("data/titanic.csv")

glimpse(titanic)

## Rows: 887
## Columns: 6
## $ pclass   <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ name     <chr> "Mr. Owen Harris Braund", "Mrs. John Bradley (Florence Briggs…
## $ sex      <chr> "male", "female", "female", "female", "male", "male", "male",…
## $ age      <dbl> 22, 38, 26, 35, 35, 27, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
## $ fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…

Exploratory Data Analysis

titanic %>%
  mutate(Survival = if_else(survived == 1, "Survived", "Died")) %>%
  ggplot(aes(x = sex, fill = Survival)) +
  geom_bar(position = "fill") + 
  labs(x = "Sex", y = "")

titanic %>%
  mutate(Survival = if_else(survived == 1, "Survived", "Died")) %>%
  ggplot(aes(x = Survival, y = age)) +
  geom_boxplot() +
  labs(y = "Age")

The linear model with multiple predictors

Population model:

\[ y = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k + \epsilon \]

Sample model that we use to estimate the population model:

\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]

Denote by \(p\) the probability of death and consider the model below.

\[ p = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k + \epsilon\] Can you see any problems with this approach?

Linear Regression?

lm_survival <- lm(survived ~ age + sex, data = titanic)
tidy(lm_survival)

## # A tibble: 3 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  0.752     0.0356      21.1   2.88e-80
## 2 age         -0.000343  0.000979    -0.350 7.26e- 1
## 3 sexmale     -0.551     0.0289     -19.1   3.50e-68

Visualizing the Model

ggplot(
  titanic,
  aes(x = age, y = survived, color = sex)
) + 
  geom_jitter(width = 0.05, height = .05, alpha = .5) +
  geom_abline(intercept = 0.799, slope = 0.000343, color = "#F57670", lwd = 1) +
  geom_abline(intercept = 0.248, slope = 0.000343, color = "#1FBEC3", lwd = 1)

Diagnostics

lm_survival_aug <- augment(lm_survival)

ggplot(
  data = lm_survival_aug,
  aes(x = .fitted, y = .resid)
) +
  geom_point() +
  labs(x = "Predicted", y = "Residuals") +
  geom_hline(yintercept = 0, lwd = 2, lty = 2, col = "red")

ggplot(
  lm_survival_aug,
  aes(x = age, y = .fitted)
) +
  geom_point() +
  geom_hline(yintercept = c(0, 1), lwd = 2, lty = 2, col = "red") +
  labs(x = "Age", y = "Predicted")

This isn’t helpful! We need to develop a new tool.

Preliminaries

Denote by \(p\) the probability of some event
The odds the event occurs is \(\frac{p}{1-p}\)

Odds are sometimes expressed as X : Y and read X to Y.

It is the ratio of successes to failures, where values larger than 1 favor a success and values smaller than 1 favor a failure.

If \(P(A) = 1/2\), what are the odds of \(A\)? 1

If \(P(B) = 1/3\) what are the odds of \(B\)? .5

An odds ratio is a ratio of odds.

More Preliminaries

Taking the natural log of the odds yields the logit of \(p\)

\[\text{logit}(p) = \text{log}\left(\frac{p}{1-p}\right)\]

The logit takes a value of \(p\) between 0 and 1 and outputs a value between \(-\infty\) and \(\infty\).

The inverse logit (logistic) takes a value between \(-\infty\) and \(\infty\) and outputs a value between 0 and 1.

\[\text{inverse logit}(x) = \frac{e^x}{1+e^x} = \frac{1}{1 + e^{-x}}\]

There is a one-to-one relationship between probabilities and log-odds. If we create a model using the log-odds we can “work backwards” using the logistic function to obtain probabilities between 0 and 1.

Logistic Regression Model

\[\text{log}\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}\]

Use the inverse logit to find the expression for \(p\).

\[p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}}}\]

We can use the logistic regression model to obtain predicted probabilities of success for a binary response variable.

Logistic Regression Model

We handle fitting the model via computer using the glm function.

logit_mod <- glm(survived ~ sex + age,
  data = titanic,
  family = "binomial"
)
tidy(logit_mod)

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  1.11      0.208       5.34  9.05e- 8
## 2 sexmale     -2.50      0.168     -14.9   3.24e-50
## 3 age         -0.00206   0.00586    -0.351 7.25e- 1

And use augment to find predicted log-odds.

pred_log_odds <- augment(logit_mod)
pred_log_odds

## # A tibble: 887 x 9
##    survived sex      age .fitted .resid .std.resid    .hat .sigma  .cooksd
##       <dbl> <chr>  <dbl>   <dbl>  <dbl>      <dbl>   <dbl>  <dbl>    <dbl>
##  1        0 male      22   -1.43 -0.655     -0.655 0.00212   1.02 0.000170
##  2        1 female    38    1.04  0.779      0.781 0.00389   1.02 0.000464
##  3        1 female    26    1.06  0.771      0.772 0.00320   1.02 0.000372
##  4        1 female    35    1.04  0.777      0.779 0.00354   1.02 0.000419
##  5        0 male      35   -1.46 -0.647     -0.647 0.00186   1.02 0.000145
##  6        0 male      27   -1.44 -0.652     -0.652 0.00181   1.02 0.000143
##  7        0 male      54   -1.50 -0.635     -0.637 0.00461   1.02 0.000347
##  8        0 male       2   -1.39 -0.667     -0.669 0.00617   1.02 0.000518
##  9        1 female    27    1.06  0.772      0.773 0.00319   1.02 0.000371
## 10        1 female    14    1.09  0.763      0.765 0.00440   1.02 0.000500
## # … with 877 more rows

The Estimated Logistic Regression Model

tidy(logit_mod)

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  1.11      0.208       5.34  9.05e- 8
## 2 sexmale     -2.50      0.168     -14.9   3.24e-50
## 3 age         -0.00206   0.00586    -0.351 7.25e- 1

\[\text{log}\left(\frac{\hat{p}}{1-\hat{p}}\right) = 1.11 - 2.50~sex - 0.00206~age\] \[\hat{p} = \frac{e^{1.11 - 2.50~sex - 0.00206~age}}{{1+e^{1.11 - 2.50~sex - 0.00206~age}}}\]

Interpreting coefficients

\[\text{log}\left(\frac{\hat{p}}{1-\hat{p}}\right) = 1.11 - 2.50~sex - 0.00206~age\]

Holding sex constant, for every additional year of age, we expect the log-odds of survival to decrease by approximately 0.002.

Holding age constant, we expect males to have a log-odds of survival that is 2.50 less than females.

Interpreting coefficients

\[\frac{\hat{p}}{1-\hat{p}} = e^{1.11 - 2.50~sex - 0.00206~age}\]

Holding sex constant, for every one year increase in age, the odds of survival is expected to be multiplied by \(e^{-0.00206} = 0.998\).

Holding age constant, the odds of survival for males is \(e^{-2.50} = 0.082\) times the odds of survival for females.

Predicted Probabilities

Let’s say you’re interested in the probability of survival of someone of a specific age and sex. You can also calculate this to get a specific number.

pred_data <- tibble(
  age = 32, 
  sex = "male"
)

augment(logit_mod, newdata = pred_data) %>%
  mutate(p = exp(.fitted) / (1 + exp(.fitted)))

## # A tibble: 1 x 4
##     age sex   .fitted     p
##   <dbl> <chr>   <dbl> <dbl>
## 1    32 male    -1.45 0.190

Model Fit

glance(logit_mod)

## # A tibble: 1 x 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1         1183.     886  -458.  922.  936.     916.         884   887

If you use glance, you can see a variety of measures of model fit.
Two measures you can see are the null deviance and the deviance. The null deviance tells us how well we can predict the response variable with only the intercept, while the deviance tells us about the model fit now that we added two predictors. Notice that the deviance drops from 1183 to 916, with the loss of two degrees of freedom for the two predictors we added to the model.
Here, lower values are better. There is a reduction in the deviance by 267 with a loss of two degrees of freedom.
You can also see the AIC and BIC here. This is based upon the deviance, but penalizes you for including more explanatory variables, like we saw for adjusted \(R^2\). AIC is useful when comparing different models.

Weaknesses

Logistic regression has assumptions: independence and linearity in the log-odds (some other methods require fewer assumptions)
If the predictors are correlated, coefficient estimates may be unreliable

Strengths

Can transform to odds ratios or predicated probabilities for interpretation of coefficients.
Handles numerical and categorical predictors
Can quantify uncertainty around a prediction
Can extend to more than 2 categories (multinomial regression)

Practice Problems

Add fare to the model. Interpret the coefficients for your variables.

model_2 <- glm(survived ~ sex + age + fare,
  data = titanic,
  family = "binomial"
)
tidy(model_2)

## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  0.838     0.215        3.89 9.87e- 5
## 2 sexmale     -2.39      0.171      -14.0  2.47e-44
## 3 age         -0.00740   0.00604     -1.23 2.20e- 1
## 4 fare         0.0116    0.00234      4.96 7.23e- 7

Holding all else constant, the odds of survival for males is \(e^{-2.39} = 0.0916\) times the odds of survival for females.

Holding all else constant, for every one year increase in age, the odds of survival is expected to be multiplied by \(e^{-0.00740} = 0.993\).

Holding all else constant, for every one pound increase in fare, the odds of survival is expected to be multiplied by \(e^{0.0116} = 1.011\).

e to the fitted slope is an odds ratio

What is the predicted probability of survival for a 40 year old man who paid 100 pounds? What if the fare he paid was 500 pounds?

tibble(
  age = c(40, 40),
  sex = c("male", "male"),
  fare = c(100, 500)
  ) %>% 
  augment(model_2, newdata = .) %>% 
  mutate(p = exp(.fitted) / (1 + exp(.fitted)))

## # A tibble: 2 x 5
##     age sex    fare .fitted     p
##   <dbl> <chr> <dbl>   <dbl> <dbl>
## 1    40 male    100  -0.692 0.334
## 2    40 male    500   3.94  0.981

Set age as being equal to its mean value. Then, create a predicted probability plot showing the effect of fare price for men and women. Describe what you see.

mean_age <- titanic %>%
  summarize(mean_age = mean(age)) %>%
  pull(mean_age)

tibble(
  age = rep(mean_age, times = 1002),
  sex = rep(c("male", "female"), each = 501),
  fare = rep(0:500, times = 2)
  ) %>% 
  augment(model_2, newdata = .) %>% 
  mutate(p = exp(.fitted) / (1 + exp(.fitted))) %>% 
  ggplot(aes(x = fare, y = p, color = sex)) +
  geom_line() +
  theme_bw()

Sources

Computing for the Social Sciences. “Logistic Regression.” https://cfss.uchicago.edu/notes/logistic-regression/
Lillis, David. Generalized Linear Models in R, Part 2: Understanding Model Fit in Logistic Regression Output

Logistic Regression

4-08-2021

Song of the Day

Main ideas

Coming up

Model selection

Backwards elimination

Forward selection

Logistic Regression

Today’s Data: A Night to Remember

Data and Packages

Exploratory Data Analysis

The linear model with multiple predictors

Linear Regression?

Visualizing the Model

Diagnostics

Preliminaries

More Preliminaries

Logistic Regression Model

Logistic Regression Model

The Estimated Logistic Regression Model

Interpreting coefficients

Interpreting coefficients

Predicted Probabilities

Model Fit

Weaknesses

Strengths

Practice Problems

Sources