CLT-based inference: confidence intervals

Song of the Day

Then He Kissed Me -The Crystals

Main ideas

Understand the CLT and how to use it
Create confidence intervals for a population mean using a CLT-based approach
Create confidence intervals for a population proportion using a CLT-based approach

Coming Up

Thursday readings
Project guidelines posted
- Project proposal due Monday 3-22
Anonymous feedback due 3-19
Homework #04 goes live on Thursday 3-18
Lab #07 on Friday 3-19
Check out DataFest

Notes

library(tidyverse)
library(infer)

In the examples and practice sections, we’ll work with a subset of data from the General Social Survey.

gss_2010 <- read_csv("data/gss_2010.csv")

Recall that for a population with a well-defined mean \(\mu\) and standard deviation \(\sigma\), these three properties hold for the distribution of sample average \(\bar{X}\), assuming certain conditions hold:

The distribution of the sample statistic is nearly normal
The distribution is centered at the unknown population mean
The variability of the distribution is inversely proportional to the square root of the sample size.

Knowing the distribution of the sample statistic \(\bar{X}\) can help us

estimate a population parameter as point estimate \(\pm\) margin of error, where the margin of error is comprised of a measure of how confident we want to be and the sample statistic’s variability.
test for a population parameter by evaluating how likely it is to obtain the observed sample statistic when assuming that the null hypothesis is true as this probability will depend on the sampling distribution’s variability.

Normal distribution

If necessary conditions are met, we can also use inference methods based on the CLT. Then the CLT tells us that \(\bar{X}\) approximately has the distribution \(N\left(\mu, \sigma/\sqrt{n}\right)\). That is,

\[Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1)\]

Visualize some normal densities

\(t\)-distribution

While we can (and will) use the CLT result to do inference, in practice, we never know the true value of \(\sigma\), and so we estimate it from our data with \(s\) (sample standard deviation). The quantity \(T\) has a t-distribution with \(n-1\) degrees of freedom:

\[ T = \frac{\bar{X} - \mu}{s/\sqrt{n}} \sim t_{n-1}\]

The \(t\)-distribution is unimodal, symmetric, and centered at 0
It has thicker tails than the normal distribution (to make up for additional variability introduced by using \(s\) instead of \(\sigma\))

What do you notice in the plot?

The \(t\)-distributions are unimodal, symmetric, and centered at 0. They have thicker tails than the normal distribution, but as the degrees of freedom increases, the \(t\)-distribution more closely resembles the normal.

Computing a confidence interval for \(\mu\)

Recall that in our bootstrap simulation-based approach to creating confidence intervals, the last step was to calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution. Rather than work with the bootstrap distribution, we can work directly with the theoretical sampling distribution of the sample statistic. We know this from the CLT.

To find cutoffs (quantiles) from the normal and t distributions, we can use functions qnorm() and qt(), respectively.

qnorm(p = 0.975, mean = 0, sd = 1)

## [1] 1.959964

qnorm(0.975)

## [1] 1.959964

qt(p = 0.975, df = 5)

## [1] 2.570582

qt(p = 0.975, df = 10)

## [1] 2.228139

qt(p = 0.975, df = 1000)

## [1] 1.962339

Example: confidence interval for \(\mu\)

The GSS asks “After an average work day, about how many hours do you have to relax or pursue activities that you enjoy?”. Compute a 95% confidence interval for the mean hours of relaxation time per day after work using a CLT-based approach.

First, we’ll check out our sample data and compute some summary statistics.

hrs_relax_stats <- gss_2010 %>% 
  filter(!is.na(hrsrelax)) %>%
  summarize(x_bar = mean(hrsrelax), 
            s     = sd(hrsrelax), 
            n     = n())

hrs_relax_stats

## # A tibble: 1 x 3
##   x_bar     s     n
##   <dbl> <dbl> <int>
## 1  3.68  2.63  1154

Direct calculation via formula

Let’s grab these three statistics as vectors to make it easier to compute our confidence interval.

n <- hrs_relax_stats$n
x_bar <- hrs_relax_stats$x_bar
s <- hrs_relax_stats$s

Our confidence interval formula is given by

\[\mbox{point estimate} \pm t^* \times \mbox{SE},\]

where our point estimate will be the sample mean, \(t^*\) is the cut value from the t-distribution corresponding to the desired confidence level, and the standard error is a function of the sample standard deviation and sample size.

\[\bar{x} \pm t^* \times \frac{s}{\sqrt{n}}\]

(t_crit <- qt(p = 0.975, df = n - 1))

## [1] 1.962024

x_bar + c(-1, 1) * t_crit * (s / sqrt(n))

## [1] 3.528364 3.832122

We are 95% confident that the true mean number of hours American adults have after work to relax or pusue activities they enjoy is between 3.52 and 3.83 hours.

infer

The infer package has a function to do these calculations in one step. Function t_test() is a tidier version of the built-in R function t.test().

t_test(gss_2010, response = hrsrelax, conf_level = 0.95)

## # A tibble: 1 x 6
##   statistic  t_df   p_value alternative lower_ci upper_ci
##       <dbl> <dbl>     <dbl> <chr>          <dbl>    <dbl>
## 1      47.5  1153 5.37e-274 two.sided       3.53     3.83

For now, focus on the last two variables - lower_ci and upper_ci. Next time we’ll discuss the first four in our lecture on hypothesis testing.

Assumptions and requirements

What assumptions must we make for this inference procedure to be valid?

Independence: the sample must be random, and the sample size must be less than 10% of the population.
Sample size. Sample size must be bigger than 30.

Practice

The built-in dataset quakes gives information on seismic events near Fiji since 1964.

Take a random sample of 40 events from quakes. You can use dplyr’s slice_sample(). Save this result as an object named quakes_40.

quakes_40 <- quakes %>% 
  slice_sample(n = 40)

Compute some summary statistics from quakes_40.

quakes_40 %>% 
  summarize(x_bar = mean(depth), s = sd(depth), n = n())

##     x_bar        s  n
## 1 317.225 217.1511 40

Compute a 90% confidence interval for the mean depth of seismic activity near Fiji.

t_test(quakes_40, response = depth, conf_level = 0.90)

## # A tibble: 1 x 6
##   statistic  t_df  p_value alternative lower_ci upper_ci
##       <dbl> <dbl>    <dbl> <chr>          <dbl>    <dbl>
## 1      9.24    39 2.29e-11 two.sided       259.     375.

Give an interpretation of your interval.

We are 90% confident that the true mean depth of seismic events off Fiji is between 259 and 375 km.

Assume quakes consists off all the seismic activity that every occurred near Fiji. Does your 90% confidence interval cover the population parameter?

My 90% confidence interval contains the population parameter, 311.371 km.

As we discovered in class, in the long-run, over repeated samples, 90% of intervals calculated using the same method will contain the true population parameter and 10% of intervals will not contain the true population parameter.

We don’t know if our particular interval contains the truth or not.

quakes %>% 
  summarize(mu = mean(depth))

##        mu
## 1 311.371

Computing a confidence interval for \(p\)

Our sample proportion \(\hat{p}\) is the most plausible value of the population proportion, \(p\), so it makes sense to build a confidence interval around this point estimate. The standard error provides a guide for how large we should make the confidence interval.

The standard error represents the standard deviation of the point estimate, and when the Central Limit Theorem conditions are satisfied, the point estimate closely follows a normal distribution. The CLT tells us that \(\hat{p}\) approximately has the distribution \(N\left(p, \sqrt{\frac{p(1-p)}{n}}\right)\).

To ensure our sample is “large” for proportions, we must verify the success-failure condition:

\(n\hat{p} \ge 10\)
\(n(1-\hat{p}) \ge 10\)

A confidence interval for \(p\) is given by

\[\hat{p} \pm z^* \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}},\] where \(z^*\) corresponds to the confidence level selected. Since we don’t know \(p\) we make a substitution using \(\hat{p}\) in our SE.

Example: confidence interval for \(p\)

The GSS asks “Are you better off today than you were four years ago?”. Compute a 95% confidence interval for the proportion of Americans that are better off today than four years ago. Use a CLT-based approach.

First, we’ll check the success-failure condition.

gss_2010 %>% 
  count(better)

## # A tibble: 2 x 2
##   better     n
## *  <dbl> <int>
## 1      0  1395
## 2      1   649

We’re also assuming these observations are independent.

Let’s compute our 95% confidence interval.

gss_2010 %>% 
  mutate(better = ifelse(better == 1, "better", "worse")) %>% 
  prop_test(response = better, conf_level = 0.95) %>% 
  select(ends_with("ci"))

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.297    0.338

Practice

Redo the above analysis using the confidence interval formula directly, but this time create a 90% confidence interval.

p_hat <- gss_2010 %>% 
  count(better) %>% 
  mutate(p_hat = n / sum(n)) %>% 
  filter(better == 1) %>% 
  pull(p_hat)

n <- nrow(gss_2010)

p_hat + c(-1, 1) * qnorm(p = 0.95) * sqrt(p_hat * (1 - p_hat) / n)

## [1] 0.3005785 0.3344509

References

“Infer - Tidy Statistical Inference”. Infer.Netlify.App, 2021, https://infer.netlify.app/index.html.

CLT-based inference: confidence intervals

3-16-2021

Song of the Day

Main ideas

Coming Up

Notes

Normal distribution

\(t\)-distribution

Computing a confidence interval for \(\mu\)

Example: confidence interval for \(\mu\)

Direct calculation via formula

infer

Assumptions and requirements

Practice

Computing a confidence interval for \(p\)

Example: confidence interval for \(p\)

Practice

References