Main ideas
- Understand the CLT and how to use it
- Create confidence intervals for a population mean using a CLT-based approach
- Create confidence intervals for a population proportion using a CLT-based approach
library(tidyverse)
library(infer)
In the examples and practice sections, we’ll work with a subset of data from the General Social Survey.
gss_2010 <- read_csv("data/gss_2010.csv")
Recall that for a population with a well-defined mean \(\mu\) and standard deviation \(\sigma\), these three properties hold for the distribution of sample average \(\bar{X}\), assuming certain conditions hold:
Knowing the distribution of the sample statistic \(\bar{X}\) can help us
estimate a population parameter as point estimate \(\pm\) margin of error, where the margin of error is comprised of a measure of how confident we want to be and the sample statistic’s variability.
test for a population parameter by evaluating how likely it is to obtain the observed sample statistic when assuming that the null hypothesis is true as this probability will depend on the sampling distribution’s variability.
If necessary conditions are met, we can also use inference methods based on the CLT. Then the CLT tells us that \(\bar{X}\) approximately has the distribution \(N\left(\mu, \sigma/\sqrt{n}\right)\). That is,
\[Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1)\]
Visualize some normal densities
While we can (and will) use the CLT result to do inference, in practice, we never know the true value of \(\sigma\), and so we estimate it from our data with \(s\) (sample standard deviation). The quantity \(T\) has a t-distribution with \(n-1\) degrees of freedom:
\[ T = \frac{\bar{X} - \mu}{s/\sqrt{n}} \sim t_{n-1}\]
The \(t\)-distribution is unimodal, symmetric, and centered at 0
It has thicker tails than the normal distribution (to make up for additional variability introduced by using \(s\) instead of \(\sigma\))
What do you notice in the plot?
The \(t\)-distributions are unimodal, symmetric, and centered at 0. They have thicker tails than the normal distribution, but as the degrees of freedom increases, the \(t\)-distribution more closely resembles the normal.
Recall that in our bootstrap simulation-based approach to creating confidence intervals, the last step was to calculate the bounds of the XX%
confidence interval as the middle XX%
of the bootstrap distribution. Rather than work with the bootstrap distribution, we can work directly with the theoretical sampling distribution of the sample statistic. We know this from the CLT.
To find cutoffs (quantiles) from the normal and t distributions, we can use functions qnorm()
and qt()
, respectively.
qnorm(p = 0.975, mean = 0, sd = 1)
## [1] 1.959964
qnorm(0.975)
## [1] 1.959964
qt(p = 0.975, df = 5)
## [1] 2.570582
qt(p = 0.975, df = 10)
## [1] 2.228139
qt(p = 0.975, df = 1000)
## [1] 1.962339
The GSS asks “After an average work day, about how many hours do you have to relax or pursue activities that you enjoy?”. Compute a 95% confidence interval for the mean hours of relaxation time per day after work using a CLT-based approach.
First, we’ll check out our sample data and compute some summary statistics.
hrs_relax_stats <- gss_2010 %>%
filter(!is.na(hrsrelax)) %>%
summarize(x_bar = mean(hrsrelax),
s = sd(hrsrelax),
n = n())
hrs_relax_stats
## # A tibble: 1 x 3
## x_bar s n
## <dbl> <dbl> <int>
## 1 3.68 2.63 1154
Let’s grab these three statistics as vectors to make it easier to compute our confidence interval.
n <- hrs_relax_stats$n
x_bar <- hrs_relax_stats$x_bar
s <- hrs_relax_stats$s
Our confidence interval formula is given by
\[\mbox{point estimate} \pm t^* \times \mbox{SE},\]
where our point estimate will be the sample mean, \(t^*\) is the cut value from the t-distribution corresponding to the desired confidence level, and the standard error is a function of the sample standard deviation and sample size.
\[\bar{x} \pm t^* \times \frac{s}{\sqrt{n}}\]
(t_crit <- qt(p = 0.975, df = n - 1))
## [1] 1.962024
x_bar + c(-1, 1) * t_crit * (s / sqrt(n))
## [1] 3.528364 3.832122
We are 95% confident that the true mean number of hours American adults have after work to relax or pusue activities they enjoy is between 3.52 and 3.83 hours.
The infer
package has a function to do these calculations in one step. Function t_test()
is a tidier version of the built-in R function t.test()
.
t_test(gss_2010, response = hrsrelax, conf_level = 0.95)
## # A tibble: 1 x 6
## statistic t_df p_value alternative lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 47.5 1153 5.37e-274 two.sided 3.53 3.83
For now, focus on the last two variables - lower_ci
and upper_ci
. Next time we’ll discuss the first four in our lecture on hypothesis testing.
What assumptions must we make for this inference procedure to be valid?
Independence: the sample must be random, and the sample size must be less than 10% of the population.
Sample size. Sample size must be bigger than 30.
The built-in dataset quakes
gives information on seismic events near Fiji since 1964.
quakes
. You can use dplyr
’s slice_sample()
. Save this result as an object named quakes_40
.quakes_40 <- quakes %>%
slice_sample(n = 40)
quakes_40
.quakes_40 %>%
summarize(x_bar = mean(depth), s = sd(depth), n = n())
## x_bar s n
## 1 317.225 217.1511 40
t_test(quakes_40, response = depth, conf_level = 0.90)
## # A tibble: 1 x 6
## statistic t_df p_value alternative lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 9.24 39 2.29e-11 two.sided 259. 375.
We are 90% confident that the true mean depth of seismic events off Fiji is between 259 and 375 km.
quakes
consists off all the seismic activity that every occurred near Fiji. Does your 90% confidence interval cover the population parameter?My 90% confidence interval contains the population parameter, 311.371 km.
As we discovered in class, in the long-run, over repeated samples, 90% of intervals calculated using the same method will contain the true population parameter and 10% of intervals will not contain the true population parameter.
We don’t know if our particular interval contains the truth or not.
quakes %>%
summarize(mu = mean(depth))
## mu
## 1 311.371
Our sample proportion \(\hat{p}\) is the most plausible value of the population proportion, \(p\), so it makes sense to build a confidence interval around this point estimate. The standard error provides a guide for how large we should make the confidence interval.
The standard error represents the standard deviation of the point estimate, and when the Central Limit Theorem conditions are satisfied, the point estimate closely follows a normal distribution. The CLT tells us that \(\hat{p}\) approximately has the distribution \(N\left(p, \sqrt{\frac{p(1-p)}{n}}\right)\).
To ensure our sample is “large” for proportions, we must verify the success-failure condition:
A confidence interval for \(p\) is given by
\[\hat{p} \pm z^* \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}},\] where \(z^*\) corresponds to the confidence level selected. Since we don’t know \(p\) we make a substitution using \(\hat{p}\) in our SE.
The GSS asks “Are you better off today than you were four years ago?”. Compute a 95% confidence interval for the proportion of Americans that are better off today than four years ago. Use a CLT-based approach.
First, we’ll check the success-failure condition.
gss_2010 %>%
count(better)
## # A tibble: 2 x 2
## better n
## * <dbl> <int>
## 1 0 1395
## 2 1 649
We’re also assuming these observations are independent.
Let’s compute our 95% confidence interval.
gss_2010 %>%
mutate(better = ifelse(better == 1, "better", "worse")) %>%
prop_test(response = better, conf_level = 0.95) %>%
select(ends_with("ci"))
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.297 0.338
Redo the above analysis using the confidence interval formula directly, but this time create a 90% confidence interval.
p_hat <- gss_2010 %>%
count(better) %>%
mutate(p_hat = n / sum(n)) %>%
filter(better == 1) %>%
pull(p_hat)
n <- nrow(gss_2010)
p_hat + c(-1, 1) * qnorm(p = 0.95) * sqrt(p_hat * (1 - p_hat) / n)
## [1] 0.3005785 0.3344509