Each sample from the population yields a slightly different sample statistic (sample mean, sample proportion, etc.)
The variability of these sample statistics is measured by the standard error
Previously we quantified this value via simulation
Today we'll discuss some of the theory underlying sampling distributions, particularly as they relate to sample means.
Statistical inference is the process of generalizing from a sample to make conclusions about a population. As part of this process, we quantify the variability of our sample statistic.
We are interested in population parameters, which we do not observe. Instead, we must calculate statistics from our sample in order to learn about them.
Suppose we’re interested in the resting heart rate of students at Duke, and are able to do the following:
Take a random sample of size n from this population, and calculate the mean resting heart rate in this sample, ¯X1
Put the sample back, take a second random sample of size n, and calculate the mean resting heart rate from this new sample, ¯X2
Suppose we’re interested in the resting heart rate of students at Duke, and are able to do the following:
Take a random sample of size n from this population, and calculate the mean resting heart rate in this sample, ¯X1
Put the sample back, take a second random sample of size n, and calculate the mean resting heart rate from this new sample, ¯X2
Put the sample back, take a third random sample of size n, and calculate the mean resting heart rate from this sample, too...
Suppose we’re interested in the resting heart rate of students at Duke, and are able to do the following:
Take a random sample of size n from this population, and calculate the mean resting heart rate in this sample, ¯X1
Put the sample back, take a second random sample of size n, and calculate the mean resting heart rate from this new sample, ¯X2
Put the sample back, take a third random sample of size n, and calculate the mean resting heart rate from this sample, too...
...and so on.
Suppose we’re interested in the resting heart rate of students at Duke, and are able to do the following:
Take a random sample of size n from this population, and calculate the mean resting heart rate in this sample, ¯X1
Put the sample back, take a second random sample of size n, and calculate the mean resting heart rate from this new sample, ¯X2
Put the sample back, take a third random sample of size n, and calculate the mean resting heart rate from this sample, too...
...and so on.
After repeating this many times, we have a dataset that has the sample averages from the population: ¯X1, ¯X2, ⋯, ¯XK (assuming we took K total samples).
For a population with a well-defined mean μ and standard deviation σ, these three properties hold for the distribution of sample average ¯X, assuming certain conditions hold:
The mean of the sampling distribution is identical to the population mean μ,
The standard deviation of the distribution of the sample averages is σ/√n, or the standard error (SE) of the mean, and
For n large enough (in the limit, as n→∞), the shape of the sampling distribution of means is approximately normal (Gaussian).
The normal distribution is unimodal and symmetric and is described by its density function:
If a random variable X follows the normal distribution, then f(x)=1√2πσ2exp{−12(x−μ)2σ2} where μ is the mean and σ2 is the variance.
We often write N(μ,σ2) to describe this distribution.
The central limit theorem tells us that sample averages are normally distributed, if we have enough data. This is true even if our original variables are not normally distributed.
Check out this interactive demonstration!
What are the conditions we need for the CLT to hold?
What are the conditions we need for the CLT to hold?
The true population parameters
rs_pop %>% summarize(mu = mean(x), sigma = sd(x))
#> # A tibble: 1 x 2#> mu sigma#> <dbl> <dbl>#> 1 16.7 14.1
If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution.
The center of the sampling distribution is at the center of the population distribution.
The sampling distribution is less variable than the population distribution (and we can quantify by how much).
If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution.
The center of the sampling distribution is at the center of the population distribution.
The sampling distribution is less variable than the population distribution (and we can quantify by how much).
What is the standard error, and how are the standard error and sample size related? What does that say about how the spread of the sampling distribution changes as n increases?
If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution.
The center of the sampling distribution is at the center of the population distribution.
The sampling distribution is less variable than the population distribution (and we can quantify by how much).
What is the standard error, and how are the standard error and sample size related? What does that say about how the spread of the sampling distribution changes as n increases?
How can we use these new results to construct confidence intervals?
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |