Main ideas
- Understand sample statistic variability
- Bootstrap idea and how to use it
- Generating bootstrap confidence intervals
- Correctly interpret confidence intervals
library(tidyverse)
library(infer)
Consider 20 1-bedroom apartments that were randomly selected on Craigslist Manhattan from apartments listed as “by owner”.
manhattan <- read_csv("data/manhattan.csv")
Consider 3-10 day survival of mice randomized to various neutron doses and streptomycin therapy or saline control.
mice <- read_table("http://users.stat.ufl.edu/~winner/data/micerad.dat",
col_names = FALSE) %>%
rename(dose = X1, treat = X2, died = X3)
Suppose the following represents the population of IQ scores. There are only 1000 individuals in our population.
population <- tibble(iq = rnorm(n = 1000, mean = 100, sd = 10))
population %>%
ggplot(aes(x = iq)) +
geom_histogram(binwidth = 5, fill = "grey90", color = "blue") +
labs(x = "IQ scores") +
theme_bw()
Take a sample with slice_sample()
and compute the mean.
population %>%
slice_sample(n = 30) %>%
summarize(mean_iq = mean(iq))
## # A tibble: 1 x 1
## mean_iq
## <dbl>
## 1 98.1
Did everyone get the same value? If fact, each time you sample (run the above code chunk) you will most likely get a slightly different mean.
The mean is not unique to this phenomenon. There will be variability for all of the sample statistics we’ll be using to produce confidence intervals.
If we want to construct a confidence interval for a population mean (or any other parameter), we need to come up with a plausible range of values around our observed sample mean (statistic). This range will depend on how precise and how accurate our sample mean (statistic) is as an estimate of the population mean (parameter). Quantifying this requires a measurement of how much we would expect the sample mean (statistic) to vary from sample to sample.
To ensure reproducibility for your knitted document, set the random number generation seed.
set.seed(1902370)
Create a 95% bootstrap confidence interval for the population mean price of 1-bedroom apartments in Manhattan.
First, identify the following:
To create our confidence interval, we’ll follow the four-step bootstrap scheme outlined in the slides using functions from infer
.
Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample.
Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed from the bootstrap samples.
Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics.
Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution.
infer
First, specify()
the variable of interest, in this case rent
.
manhattan %>%
specify(response = rent)
## Response: rent (numeric)
## # A tibble: 20 x 1
## rent
## <dbl>
## 1 3850
## 2 3800
## 3 2350
## 4 3200
## 5 2150
## 6 3267
## 7 2495
## 8 2349
## 9 3950
## 10 1795
## 11 2145
## 12 2300
## 13 1775
## 14 2000
## 15 2175
## 16 2350
## 17 2550
## 18 4195
## 19 1470
## 20 2350
Second, generate()
a fixed number of bootstrap samples. Here we’ll generate 10,000 bootstrap samples. Generally, the smaller your sample size, the more bootstrap samples you’ll want to generate.
manhattan %>%
specify(response = rent) %>%
generate(reps = 10000, type = "bootstrap")
## Response: rent (numeric)
## # A tibble: 200,000 x 2
## # Groups: replicate [10,000]
## replicate rent
## <int> <dbl>
## 1 1 2350
## 2 1 2150
## 3 1 3950
## 4 1 3850
## 5 1 1470
## 6 1 1795
## 7 1 2145
## 8 1 2150
## 9 1 3850
## 10 1 3800
## # … with 199,990 more rows
Finally, calculate()
the relevant statistic for each bootstrap sample.
boot_means <- manhattan %>%
specify(response = rent) %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "mean")
boot_means
## # A tibble: 10,000 x 2
## replicate stat
## * <int> <dbl>
## 1 1 2869.
## 2 2 2749.
## 3 3 2466.
## 4 4 2738.
## 5 5 2725.
## 6 6 2615.
## 7 7 2761.
## 8 8 2788.
## 9 9 2256.
## 10 10 2844.
## # … with 9,990 more rows
Typically, you will do this all in one code chunk as is given in the chunk named calculate
. It is broken up here to better understand the steps.
Visualize boot_means
with ggplot()
.
boot_means %>%
ggplot(aes(x = stat)) +
geom_histogram(binwidth = 50)
Finally, compute the lower and upper bounds to form our interval estimate. For a 95% confidence interval, these bounds occur at the 2.5% and 97.5% quantiles.
boot_means %>%
summarize(
lb = quantile(stat, 0.025),
ub = quantile(stat, 0.975)
)
## # A tibble: 1 x 2
## lb ub
## <dbl> <dbl>
## 1 2299. 2977.
Package infer
also has a convenient function to do this: get_ci()
. Use whichever method you prefer.
get_ci(boot_means, level = 0.95)
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 2299. 2977.
We are 95% confident that the true mean price of one bedroom apartments for sale by owner in Manhattan is ($2,299, $2,977). If we took many samples from the population and constructed a 95% confidence interval for each sample, we expect approximately 95% of the intervals to contain the true population mean.
Use infer
functions to create a 98% bootstrap confidence interval for the population median price of 1-bedroom apartments in Manhattan. Do this in a single code chunk. Provide an interpretation of your result.
manhattan %>%
specify(response = rent) %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "median") %>%
get_ci(level = 0.98)
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 2150 3234.
We are 98% confident that the true median price of one bedroom apartments for sale by owner in Manhattan is ($2,150, $3,234). If we took many samples from the population and constructed a 98% confidence interval for each sample, we expect approximately 98% of the intervals to contain the true population median.
Consider the mice radiation dataset. Suppose we want to compute a 95% confidence interval for the proportion of mice that died while not on the treatment, regardless of the dose.
boot_prop <- mice %>%
filter(treat == 0) %>%
mutate(died = if_else(died == 1, "yes", "no")) %>%
specify(response = died, success = "yes") %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "prop")
ci_mice_died <- get_ci(boot_prop, level = .95)
ci_mice_died
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.690 0.797
We are 95% confident that the true proportion of mice that died while not on the treatment is between 0.69 and 0.797. If we were to repeatedly sample from the original population and construct many 95% confidence intervals, we would expect 95% of the intervals to contain the true population proportion.
visualize(boot_prop) +
shade_ci(ci_mice_died)
Consider the mice radiation results. Suppose we want to compute a 90% confidence interval for the proportion of mice that lived while on the treatment, regardless of the dose.
boot_prop_live <- mice %>%
filter(treat == 1) %>%
mutate(died = if_else(died == 0, "no", "yes")) %>%
specify(response = died, success = "no") %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "prop")
ci_mice_live <- get_ci(boot_prop_live, level = 0.90)
ci_mice_live
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.552 0.651
visualise(boot_prop_live) +
shade_ci(ci_mice_live)
We are 90% confident that the true proportion of mice that lived while on the treatment is between 0.552 and 0.651. If we were to repeatedly sample from the original population and construct many 90% confidence intervals, we would expect 90% of the intervals to contain the true population proportion.