Foundations of inference

# Foundations of inference

---

# Recall

---

## The statistical process

Statistics is a process that converts data into useful information, whereby
practitioners

1. form a question of interest,

2. collect and summarize data,

3. and interpret the results.

---

## The population of interest

The .vocab[population] is the group we'd like to learn something about. For 
example:

- What is the prevalence of diabetes among **U.S. adults**, and has it changed
  over time? 
  
- Does the average amount of caffeine vary by vendor in **12 oz. cups of**
  **coffee at Duke coffee shops**?
  
- Is there a relationship between tumor type and five-year mortality among
  **breast cancer patients**?

The .vocab[research question of interest] is what we want to answer - often 
relating one or more numerical quantities or summary statistics.

If we had data from every unit in the population, we could just calculate what
we wanted and be done!

---

## Sampling from the population

Unfortunately, we (usually) have to settle with a .vocab[sample] from the
population.

Ideally, the sample is .vocab[representative], allowing us to make conclusions 
that are .vocab[generalizable] to the broader population of interest.

In order to make a formal statistical statement about the broader population of
interest when all we have is a sample, we need to use the tools of probability
and statistical inference.

---

## Big picture

Let's discuss a few population characteristics we might be interested in.

---

# Terminology

---

## Explanatory and response variables

When we suspect one variable might causally affect another, we label the first variable the .vocab[explanatory variable] and the second the 
.vocab[response variable]. *Whether or not we can actually make this causal 
connection will depend on the type of statistical study (more on this shortly).*

<br/>

`$$\mbox{Explanatory Variable} \longrightarrow \mbox{Response Variable}$$`

<br/>

Do larger homes in good locations lead to higher home selling prices? What
are the explanatory and response variables?

---

## Population, parameter; sample, statistic

- Parameters could be the mean, median, correlation, maximum, etc.

If we had data from every unit in the population, we could just calculate 
population parameters and be done! **Unfortunately, we usually cannot do this.**

---

## Take a sample

.vocab[Statistic]: a numerical quantity derived from a sample
  - Statistics could be the mean, median, correlation, maximum, etc.

Naturally, it makes sense to use the sample mean (and other quantities 
derived from the sample) to make generalizations about the population mean.

---

## Statistical inference

.vocab[Statistical inference] is the process of using sample data to make 
  conclusions about the underlying population the sample came from.

- .vocab[Estimation]: estimating an unknown parameter based on values from the
  sample at hand

- .vocab[Testing]: evaluating whether our observed sample provides evidence 
  for or against some claim about the population
  
In the coming lectures we'll discuss each of these inference approaches.

<br/>

Before we get into this, let's discuss ways samples can be obtained and
what type of conclusions we'll be be able to make and **not** make as a result
of our statistical process.

---

# Sampling

---

## Sampling strategies

- In our discussions on probability, we considered randomly selecting
  individuals from studies, where each individual was equally likely to be
  selected. This form of random sampling is known as 
  .vocab[simple random sampling].

--
  
- .vocab[Stratified sampling] divides the population into .vocab[strata] such
  that each strata is homogenous. Then a simple random sample is applied within
  each stratum.

--
  
- .vocab[Cluster sampling] first partitions the population into
  .vocab[clusters], where each cluster is representative of the population. A
  fixed number of clusters is selected and all observations within the cluster
  are included in the sample.
  
--
  
- .vocab[Multistage sampling] is similar to cluster sampling, but rather than
  keep all observations in each cluster, only a random sample of observations
  is kept.

---

## Example

Suppose we are interested in estimating the malaria rate in a densely tropical 
portion of rural Indonesia. We learn that there are 30 villages in that part 
of the Indonesian jungle, each more or less similar to the next. Our goal is to 
test 150 individuals for malaria. What are the costs and benefits to using
the four aforementioned sampling techniques?

???

- Simple random sample: expensive, may not get good representation from all
  30 villages
  
- Stratified sample: not clear how to build strata on an individual basis. 
  If strata are the villages, then some villages will be left out.
  
- Cluster sample / multistage: these are the best options here.

---

## Sample bias

- The four sampling strategies help reduce .vocab[bias] in our sample. A biased
  sample can lead to erroneous conclusions.
  
- Bias can still appear if the non-response rate is very high. 
    - Is our sample representative of the population or is it representative of 
      the population that "responded" to the survey?
  
<img src="12-found-inf_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />

---

# Statistical studies and conclusions

---

## Observational studies and experiments

- Observational

- Collect data in a way that does not interfere with how the data arise 
      ("observe")
    - Only establish an **association**
    - Data often cheaper and easier to collect
    
--

- Experimental

- Randomly assign subjects to treatments
    - Establish **causal connections**
    - Often more expensive
    - Sometimes it is impossible or unethical to design an experiment

---

## Random sampling vs. random assignment

What do you think Pfizer did in their trials for the COVID-19 vaccine 
development?

---

# Pitfalls

---

## "Lucky coincidences"

![](img/12/correlation1.png)

*Source*: [Tyler Vigen's site of spurious correlations:](https://www.tylervigen.com/spurious-correlations)

---

## "Lucky coincidences"

![](img/12/correlation2.png)

*Source*: [Tyler Vigen's site of spurious correlations:](https://www.tylervigen.com/spurious-correlations)

---

## "Lucky coincidences"

![](img/12/correlation3.png)

*Source*: [Tyler Vigen's site of spurious correlations:](https://www.tylervigen.com/spurious-correlations)

---

## Confounding variables

A .vocab[confounding] variable is an an extraneous variable that affects both 
the explanatory and the response variable, and makes it seem like there is 
a relationship between them.

Identify the confounding variable in each of the following statements:

1. As the amount of ice cream sales increases, the number of shark 
   attacks also increases.

2. The higher the number of firefighters at a fire is, the greater the amount 
   of damage caused by that fire.

3. Taller children are better at both reading and math compared to shorter 
   children.
   
---

Click the link below to create the repository for lecture notes #12.
  - [https://classroom.github.com/a/Rc4bTdY7](https://classroom.github.com/a/Rc4bTdY7)