Simulation-based inference: confidence intervals

Song of the Day

Symphony No. 4 -David Maslanka

Main ideas

Understand sample statistic variability
Bootstrap idea and how to use it
Generating bootstrap confidence intervals
Correctly interpret confidence intervals

Coming Up

Homework #03 due Thursday 3-04.

Packages

library(tidyverse)
library(infer)

Data

Consider 20 1-bedroom apartments that were randomly selected on Craigslist Manhattan from apartments listed as “by owner”.

manhattan <- read_csv("data/manhattan.csv")

Consider 3-10 day survival of mice randomized to various neutron doses and streptomycin therapy or saline control.

mice <- read_table("http://users.stat.ufl.edu/~winner/data/micerad.dat",
                   col_names = FALSE) %>% 
  rename(dose = X1, treat = X2, died = X3)

Variability of sample statistics

Suppose the following represents the population of IQ scores. There are only 1000 individuals in our population.

population <- tibble(iq = rnorm(n = 1000, mean = 100, sd = 10))

population %>% 
  ggplot(aes(x = iq)) +
  geom_histogram(binwidth = 5, fill = "grey90", color = "blue") +
  labs(x = "IQ scores") +
  theme_bw()

Take a sample with slice_sample() and compute the mean.

population %>% 
  slice_sample(n = 30) %>% 
  summarize(mean_iq = mean(iq))

## # A tibble: 1 x 1
##   mean_iq
##     <dbl>
## 1    98.1

Did everyone get the same value? If fact, each time you sample (run the above code chunk) you will most likely get a slightly different mean.

The mean is not unique to this phenomenon. There will be variability for all of the sample statistics we’ll be using to produce confidence intervals.

If we want to construct a confidence interval for a population mean (or any other parameter), we need to come up with a plausible range of values around our observed sample mean (statistic). This range will depend on how precise and how accurate our sample mean (statistic) is as an estimate of the population mean (parameter). Quantifying this requires a measurement of how much we would expect the sample mean (statistic) to vary from sample to sample.

Confidence intervals via bootstrap simulation

To ensure reproducibility for your knitted document, set the random number generation seed.

set.seed(1902370)

A confidence interval for the population mean - $\mu$

Create a 95% bootstrap confidence interval for the population mean price of 1-bedroom apartments in Manhattan.

First, identify the following:

Population: one bedroom Manhattan apartments listed on Craigslist “by owner”
Parameter of interest: The true, unknown population mean price of one bedroom apartments in Manhattan
Sample: 20 randomly selected apartments from Craigslist

To create our confidence interval, we’ll follow the four-step bootstrap scheme outlined in the slides using functions from infer.

Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample.
Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed from the bootstrap samples.
Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics.
Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution.

`infer`

First, specify() the variable of interest, in this case rent.

manhattan %>% 
  specify(response = rent)

## Response: rent (numeric)
## # A tibble: 20 x 1
##     rent
##    <dbl>
##  1  3850
##  2  3800
##  3  2350
##  4  3200
##  5  2150
##  6  3267
##  7  2495
##  8  2349
##  9  3950
## 10  1795
## 11  2145
## 12  2300
## 13  1775
## 14  2000
## 15  2175
## 16  2350
## 17  2550
## 18  4195
## 19  1470
## 20  2350

Second, generate() a fixed number of bootstrap samples. Here we’ll generate 10,000 bootstrap samples. Generally, the smaller your sample size, the more bootstrap samples you’ll want to generate.

manhattan %>% 
  specify(response = rent) %>% 
  generate(reps = 10000, type = "bootstrap")

## Response: rent (numeric)
## # A tibble: 200,000 x 2
## # Groups:   replicate [10,000]
##    replicate  rent
##        <int> <dbl>
##  1         1  2350
##  2         1  2150
##  3         1  3950
##  4         1  3850
##  5         1  1470
##  6         1  1795
##  7         1  2145
##  8         1  2150
##  9         1  3850
## 10         1  3800
## # … with 199,990 more rows

Finally, calculate() the relevant statistic for each bootstrap sample.

boot_means <- manhattan %>% 
  specify(response = rent) %>% 
  generate(reps = 10000, type = "bootstrap") %>% 
  calculate(stat = "mean")

boot_means

## # A tibble: 10,000 x 2
##    replicate  stat
##  *     <int> <dbl>
##  1         1 2869.
##  2         2 2749.
##  3         3 2466.
##  4         4 2738.
##  5         5 2725.
##  6         6 2615.
##  7         7 2761.
##  8         8 2788.
##  9         9 2256.
## 10        10 2844.
## # … with 9,990 more rows

Typically, you will do this all in one code chunk as is given in the chunk named calculate. It is broken up here to better understand the steps.

Visualize boot_means with ggplot().

boot_means %>% 
  ggplot(aes(x = stat)) +
  geom_histogram(binwidth = 50)

Finally, compute the lower and upper bounds to form our interval estimate. For a 95% confidence interval, these bounds occur at the 2.5% and 97.5% quantiles.

boot_means %>% 
  summarize(
    lb = quantile(stat, 0.025),
    ub = quantile(stat, 0.975)
  )

## # A tibble: 1 x 2
##      lb    ub
##   <dbl> <dbl>
## 1 2299. 2977.

Package infer also has a convenient function to do this: get_ci(). Use whichever method you prefer.

get_ci(boot_means, level = 0.95)

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    2299.    2977.

We are 95% confident that the true mean price of one bedroom apartments for sale by owner in Manhattan is ($2,299, $2,977). If we took many samples from the population and constructed a 95% confidence interval for each sample, we expect approximately 95% of the intervals to contain the true population mean.

Practice

Use infer functions to create a 98% bootstrap confidence interval for the population median price of 1-bedroom apartments in Manhattan. Do this in a single code chunk. Provide an interpretation of your result.

manhattan %>% 
  specify(response = rent) %>% 
  generate(reps = 10000, type = "bootstrap") %>% 
  calculate(stat = "median") %>% 
  get_ci(level = 0.98)

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     2150    3234.

We are 98% confident that the true median price of one bedroom apartments for sale by owner in Manhattan is ($2,150, $3,234). If we took many samples from the population and constructed a 98% confidence interval for each sample, we expect approximately 98% of the intervals to contain the true population median.

A confidence interval for the population proportion - $p$

Consider the mice radiation dataset. Suppose we want to compute a 95% confidence interval for the proportion of mice that died while not on the treatment, regardless of the dose.

boot_prop <- mice %>% 
  filter(treat == 0) %>% 
  mutate(died = if_else(died == 1, "yes", "no")) %>% 
  specify(response = died, success = "yes") %>% 
  generate(reps = 10000, type = "bootstrap") %>% 
  calculate(stat = "prop")

ci_mice_died <- get_ci(boot_prop, level = .95)
ci_mice_died

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.690    0.797

We are 95% confident that the true proportion of mice that died while not on the treatment is between 0.69 and 0.797. If we were to repeatedly sample from the original population and construct many 95% confidence intervals, we would expect 95% of the intervals to contain the true population proportion.

visualize(boot_prop) +
  shade_ci(ci_mice_died)

Practice

Consider the mice radiation results. Suppose we want to compute a 90% confidence interval for the proportion of mice that lived while on the treatment, regardless of the dose.

boot_prop_live <- mice %>% 
  filter(treat == 1) %>% 
  mutate(died = if_else(died == 0, "no", "yes")) %>% 
  specify(response = died, success = "no") %>% 
  generate(reps = 10000, type = "bootstrap") %>% 
  calculate(stat = "prop")

ci_mice_live <- get_ci(boot_prop_live, level = 0.90)
ci_mice_live

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.552    0.651

visualise(boot_prop_live) +
  shade_ci(ci_mice_live)

We are 90% confident that the true proportion of mice that lived while on the treatment is between 0.552 and 0.651. If we were to repeatedly sample from the original population and construct many 90% confidence intervals, we would expect 90% of the intervals to contain the true population proportion.

References

Source: C.W. Hammond, et al. (1955). “The Effect of Streptomycin Therapy on Mice Irradiated with Fast Neutrons”, Radiation Research, Vol2,#4, pp.354-360

Simulation-based inference: confidence intervals

3-04-2020

Song of the Day

Main ideas

Coming Up

Packages

Data

Variability of sample statistics

Confidence intervals via bootstrap simulation

A confidence interval for the population mean - \(\mu\)

`infer`

Practice

A confidence interval for the population proportion - \(p\)

Practice

References