class: center, middle, inverse, title-slide

# CLT-based inference: confidence intervals

---

class: center, middle, inverse

# Sample statistics and sampling distributions

---

## Variability of sample statistics

- Each sample from the population yields a slightly different sample 
  statistic (sample mean, sample proportion, etc.)

- The variability of these sample statistics is measured by the 
  .vocab[standard error]

- Previously we quantified this value via simulation

- Today we'll discuss some of the theory underlying 
  .vocab[sampling distributions], particularly as they relate to *sample means*.

---

## Recall

Statistical inference is the process of generalizing from a sample to make 
conclusions about a population. As part of this process, we quantify the 
variability of our sample statistic.

We are interested in population parameters, which we do not observe. Instead, we 
must calculate statistics from our sample in order to learn about them.

---

## Sampling distribution of the mean

Suppose we’re interested in the resting heart rate of students at Duke, and are 
able to do the following:

--

1. Take a random sample of size `\(n\)` from this population, and calculate the 
   mean resting heart rate in this sample, `\(\bar{X}_1\)`

--

2. Put the sample back, take a second random sample of size `\(n\)`, and calculate 
   the mean resting heart rate from this new sample, `\(\bar{X}_2\)`

--

3. Put the sample back, take a third random sample of size `\(n\)`, and calculate
   the mean resting heart rate from this sample, too...

--

...and so on.

--

After repeating this many times, we have a dataset that has the
sample averages from the population: `\(\bar{X}_1\)`, `\(\bar{X}_2\)`, `\(\cdots\)`,
`\(\bar{X}_K\)` (assuming we took `\(K\)` total samples).

---

## Sampling distribution of the mean

.question[
Can we say anything about the distribution of these sample means? 
]

*(Keep in mind, we don't know what the underlying distribution of mean resting 
heart rate looks like in Duke students!)*

--

<font class = "vocab">As it turns out, yes we can!</font>

---

## The Central Limit Theorem

For a population with a well-defined mean `\(\mu\)` and standard deviation `\(\sigma\)`,
these three properties hold for the distribution of sample average `\(\bar{X}\)`,
assuming certain conditions hold:

1. The mean of the sampling distribution is identical to the population mean
`\(\mu\)`,

2. The standard deviation of the distribution of the sample averages is
`\(\sigma/\sqrt{n}\)`, or the **standard error** (SE) of the mean, and

3. For `\(n\)` large enough (in the limit, as `\(n \to \infty\)`), the shape of the
sampling distribution of means is approximately *normal* (Gaussian).

---

## What is the normal (Gaussian) distribution?

The normal distribution is unimodal and symmetric and is described by its
*density function*:

If a random variable `\(X\)` follows the normal distribution, then
`$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{1}{2}\frac{(x - \mu)^2}{\sigma^2} \right\}$$`
where `\(\mu\)` is the mean and `\(\sigma^2\)` is the variance.

We often write `\(N(\mu, \sigma^2)\)` to describe this distribution.

---

## The normal distribution (graphically)

We will talk about probability densities and using them to define probabilities
during next week's lecture, but for now, just know that the normal 
distribution is the familiar "bell curve":

<img src="15-clt-ci_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />

---

## But we didn't know anything about the underlying distribution!

The central limit theorem tells us that sample averages are
normally distributed, if we have enough data. This is true even if
our original variables are not normally distributed.

--

[**Check out this interactive demonstration!**](http://onlinestatbook.com/stat_sim/sampling_dist/index.html)

---

## Conditions

What are the conditions we need for the CLT to hold?

--

- **Independence:** The sampled observations must be independent. This is 
difficult to check, but the following are useful guidelines:
    - the sample must be random
    - if sampling without replacement, sample size must be less than 10% of the 
    population size

--
    
- **Sample size / distribution:** 
    - if data are numerical, usually n > 30 is considered a large enough sample, but if the underlying
    population distribution is extremely skewed, more might be needed
    - if we know for sure that the underlying data are normal, then the 
    distribution of sample averages will also be exactly normal, regardless of
    the sample size
    - if data are categorical, at least 10 successes and 10 failures.

---

## Let's run our own simulation

**The underlying population** (we never observe this!)

```r
rs_pop <- tibble(x = rbeta(100000, 1, 5) * 100)
```

<img src="15-clt-ci_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />

---

**The true population parameters**

```r
rs_pop %>%
  summarize(mu = mean(x), sigma = sd(x))
```

```
#> # A tibble: 1 x 2
#>      mu sigma
#>   <dbl> <dbl>
#> 1  16.7  14.1
```

---

## Sampling from the population - 1

```r
samp_1 <- rs_pop %>%
  sample_n(size = 50, replace = TRUE)
```

--

```r
samp_1 %>%
  summarize(x_bar = mean(x))
```

```
#> # A tibble: 1 x 1
#>   x_bar
#>   <dbl>
#> 1  14.4
```

---

## Sampling from the population - 2

```r
samp_2 <- rs_pop %>%
  sample_n(size = 50, replace = TRUE)
```

--

```r
samp_2 %>%
  summarize(x_bar = mean(x))
```

```
#> # A tibble: 1 x 1
#>   x_bar
#>   <dbl>
#> 1  16.2
```

---

## Sampling from the population - 3

```r
samp_3 <- rs_pop %>%
  sample_n(size = 50, replace = TRUE)
```

--

```r
samp_3 %>%
  summarize(x_bar = mean(x))
```

```
#> # A tibble: 1 x 1
#>   x_bar
#>   <dbl>
#> 1  18.4
```

--

Keep repeating...

---

## Sampling distribution

.tiny[

```r
sampling <- rs_pop %>%
  rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>%
  group_by(replicate) %>%
  summarize(xbar = mean(x))
```
]

<img src="15-clt-ci_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />

---

## Sampling distribution quantities

```
#> # A tibble: 1 x 2
#>    mean    se
#>   <dbl> <dbl>
#> 1  16.7  1.99
```

---

## Comparison

.question[
How do the shapes, centers, and spreads of these distributions compare?
]

<img src="15-clt-ci_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />

---

## Recap

- If certain assumptions are satisfied, **regardless of the shape of the 
  population distribution**, the sampling distribution of the mean follows an 
  approximately normal distribution.

- The center of the sampling distribution is at the center of the population 
  distribution.

- The sampling distribution is less variable than the population distribution 
  (and we can quantify by how much).

--

.question[
What is the standard error, and how are the standard error and sample size 
related? What does that say about how the spread of the sampling distribution
changes as `\(n\)` increases?
]

--

How can we use these new results to construct confidence intervals?

---

## Let's use the CLT to create confidence intervals

Click the link below to create the repository for lecture notes #15.
  - [https://classroom.github.com/a/q6pfmkmx](https://classroom.github.com/a/q6pfmkmx)

Notes for current slide

Notes for next slide

CLT-based inference: confidence intervals

1 / 21

Sample statistics and sampling distributions

2 / 21

Variability of sample statistics

Each sample from the population yields a slightly different sample statistic (sample mean, sample proportion, etc.)
The variability of these sample statistics is measured by the standard error
Previously we quantified this value via simulation
Today we'll discuss some of the theory underlying sampling distributions, particularly as they relate to sample means.

3 / 21

Recall

Statistical inference is the process of generalizing from a sample to make conclusions about a population. As part of this process, we quantify the variability of our sample statistic.

We are interested in population parameters, which we do not observe. Instead, we must calculate statistics from our sample in order to learn about them.

4 / 21

Sampling distribution of the mean

Suppose we’re interested in the resting heart rate of students at Duke, and are able to do the following:

5 / 21

Sampling distribution of the mean

Suppose we’re interested in the resting heart rate of students at Duke, and are able to do the following:

Take a random sample of size $n$ from this population, and calculate the mean resting heart rate in this sample, ${\bar{X}}_{1}$

5 / 21

Sampling distribution of the mean

Suppose we’re interested in the resting heart rate of students at Duke, and are able to do the following:

Take a random sample of size $n$ from this population, and calculate the mean resting heart rate in this sample, ${\bar{X}}_{1}$
Put the sample back, take a second random sample of size $n$ , and calculate the mean resting heart rate from this new sample, ${\bar{X}}_{2}$

5 / 21

Sampling distribution of the mean

Suppose we’re interested in the resting heart rate of students at Duke, and are able to do the following:

Take a random sample of size $n$ from this population, and calculate the mean resting heart rate in this sample, ${\bar{X}}_{1}$
Put the sample back, take a second random sample of size $n$ , and calculate the mean resting heart rate from this new sample, ${\bar{X}}_{2}$
Put the sample back, take a third random sample of size $n$ , and calculate the mean resting heart rate from this sample, too...

5 / 21

Sampling distribution of the mean

Suppose we’re interested in the resting heart rate of students at Duke, and are able to do the following:

Take a random sample of size $n$ from this population, and calculate the mean resting heart rate in this sample, ${\bar{X}}_{1}$
Put the sample back, take a second random sample of size $n$ , and calculate the mean resting heart rate from this new sample, ${\bar{X}}_{2}$
Put the sample back, take a third random sample of size $n$ , and calculate the mean resting heart rate from this sample, too...

...and so on.

5 / 21

Sampling distribution of the mean

Suppose we’re interested in the resting heart rate of students at Duke, and are able to do the following:

Take a random sample of size $n$ from this population, and calculate the mean resting heart rate in this sample, ${\bar{X}}_{1}$
Put the sample back, take a second random sample of size $n$ , and calculate the mean resting heart rate from this new sample, ${\bar{X}}_{2}$
Put the sample back, take a third random sample of size $n$ , and calculate the mean resting heart rate from this sample, too...

...and so on.

After repeating this many times, we have a dataset that has the sample averages from the population: ${\bar{X}}_{1}$ , ${\bar{X}}_{2}$ , $\dots$ , ${\bar{X}}_{K}$ (assuming we took $K$ total samples).

5 / 21

Sampling distribution of the mean

Can we say anything about the distribution of these sample means?

(Keep in mind, we don't know what the underlying distribution of mean resting heart rate looks like in Duke students!)

6 / 21

Sampling distribution of the mean

Can we say anything about the distribution of these sample means?

(Keep in mind, we don't know what the underlying distribution of mean resting heart rate looks like in Duke students!)

As it turns out, yes we can!

6 / 21

The Central Limit Theorem

For a population with a well-defined mean $μ$ and standard deviation $σ$ , these three properties hold for the distribution of sample average $\bar{X}$ , assuming certain conditions hold:

The mean of the sampling distribution is identical to the population mean $μ$ ,
The standard deviation of the distribution of the sample averages is $σ / \sqrt{n}$ , or the standard error (SE) of the mean, and
For $n$ large enough (in the limit, as $n \to \infty$ ), the shape of the sampling distribution of means is approximately normal (Gaussian).

7 / 21

What is the normal (Gaussian) distribution?

The normal distribution is unimodal and symmetric and is described by its density function:

If a random variable $X$ follows the normal distribution, then $f (x) = \frac{1}{\sqrt{2 π σ^{2}}} \exp {- \frac{1}{2} \frac{(x - μ)^{2}}{σ^{2}}}$ where $μ$ is the mean and $σ^{2}$ is the variance.

We often write $N (μ, σ^{2})$ to describe this distribution.

8 / 21

The normal distribution (graphically)

We will talk about probability densities and using them to define probabilities during next week's lecture, but for now, just know that the normal distribution is the familiar "bell curve":

9 / 21

But we didn't know anything about the underlying distribution!

The central limit theorem tells us that sample averages are normally distributed, if we have enough data. This is true even if our original variables are not normally distributed.

10 / 21

But we didn't know anything about the underlying distribution!

The central limit theorem tells us that sample averages are normally distributed, if we have enough data. This is true even if our original variables are not normally distributed.

Check out this interactive demonstration!

10 / 21

Conditions

What are the conditions we need for the CLT to hold?

11 / 21

Conditions

What are the conditions we need for the CLT to hold?

Independence: The sampled observations must be independent. This is difficult to check, but the following are useful guidelines:
- the sample must be random
- if sampling without replacement, sample size must be less than 10% of the population size

11 / 21

Conditions

What are the conditions we need for the CLT to hold?

Independence: The sampled observations must be independent. This is difficult to check, but the following are useful guidelines:
- the sample must be random
- if sampling without replacement, sample size must be less than 10% of the population size

Sample size / distribution:
- if data are numerical, usually n > 30 is considered a large enough sample, but if the underlying population distribution is extremely skewed, more might be needed
- if we know for sure that the underlying data are normal, then the distribution of sample averages will also be exactly normal, regardless of the sample size
- if data are categorical, at least 10 successes and 10 failures.

11 / 21

Let's run our own simulation

The underlying population (we never observe this!)

rs_pop <- tibble(x = rbeta(100000, 1, 5) * 100)

12 / 21

The true population parameters

rs_pop %>%
  summarize(mu = mean(x), sigma = sd(x))

#> # A tibble: 1 x 2
#>      mu sigma
#>   <dbl> <dbl>
#> 1  16.7  14.1

13 / 21

Sampling from the population - 1

samp_1 <- rs_pop %>%
  sample_n(size = 50, replace = TRUE)

14 / 21

Sampling from the population - 1

samp_1 <- rs_pop %>%
  sample_n(size = 50, replace = TRUE)

samp_1 %>%
  summarize(x_bar = mean(x))

#> # A tibble: 1 x 1
#>   x_bar
#>   <dbl>
#> 1  14.4

14 / 21

Sampling from the population - 2

samp_2 <- rs_pop %>%
  sample_n(size = 50, replace = TRUE)

15 / 21

Sampling from the population - 2

samp_2 <- rs_pop %>%
  sample_n(size = 50, replace = TRUE)

samp_2 %>%
  summarize(x_bar = mean(x))

#> # A tibble: 1 x 1
#>   x_bar
#>   <dbl>
#> 1  16.2

15 / 21

Sampling from the population - 3

samp_3 <- rs_pop %>%
  sample_n(size = 50, replace = TRUE)

16 / 21

Sampling from the population - 3

samp_3 <- rs_pop %>%
  sample_n(size = 50, replace = TRUE)

samp_3 %>%
  summarize(x_bar = mean(x))

#> # A tibble: 1 x 1
#>   x_bar
#>   <dbl>
#> 1  18.4

16 / 21

Sampling from the population - 3

samp_3 <- rs_pop %>%
  sample_n(size = 50, replace = TRUE)

samp_3 %>%
  summarize(x_bar = mean(x))

#> # A tibble: 1 x 1
#>   x_bar
#>   <dbl>
#> 1  18.4

Keep repeating...

16 / 21

Sampling distribution

sampling <- rs_pop %>%
  rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>%
  group_by(replicate) %>%
  summarize(xbar = mean(x))

17 / 21

Sampling distribution quantities

#> # A tibble: 1 x 2
#>    mean    se
#>   <dbl> <dbl>
#> 1  16.7  1.99

18 / 21

Comparison

How do the shapes, centers, and spreads of these distributions compare?

19 / 21

Recap

If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution.
The center of the sampling distribution is at the center of the population distribution.
The sampling distribution is less variable than the population distribution (and we can quantify by how much).

20 / 21

Recap

If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution.
The center of the sampling distribution is at the center of the population distribution.
The sampling distribution is less variable than the population distribution (and we can quantify by how much).

What is the standard error, and how are the standard error and sample size related? What does that say about how the spread of the sampling distribution changes as $n$ increases?

20 / 21

Recap

If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution.
The center of the sampling distribution is at the center of the population distribution.
The sampling distribution is less variable than the population distribution (and we can quantify by how much).

What is the standard error, and how are the standard error and sample size related? What does that say about how the spread of the sampling distribution changes as $n$ increases?

How can we use these new results to construct confidence intervals?

20 / 21

Let's use the CLT to create confidence intervals

Click the link below to create the repository for lecture notes #15.

https://classroom.github.com/a/q6pfmkmx

21 / 21

Sample statistics and sampling distributions

2 / 21

Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Esc	Back to slideshow