Introduction to Probability

Song of the Day

Locomotive Breath -Jethro Tull

Main ideas

Use formulas to compute probabilities from tabular data
Compute empirical probabilities in R via simulation

Coming Up

Lab #05 this Friday 2-26
Homework #03 goes live on Thursday 2-26

Lecture Notes and Exercises

We will use the tidyverse as usual and vcd to work with the Arthritis data. Note you may need to install vcd using install.packages().

library(tidyverse)
library(vcd)

Computing probabilities

data(Arthritis)
glimpse(Arthritis)

## Rows: 84
## Columns: 5
## $ ID        <int> 57, 46, 77, 17, 36, 23, 75, 39, 33, 55, 30, 5, 63, 83, 66, 4…
## $ Treatment <fct> Treated, Treated, Treated, Treated, Treated, Treated, Treate…
## $ Sex       <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, …
## $ Age       <int> 27, 29, 30, 32, 46, 58, 59, 59, 63, 63, 64, 64, 69, 70, 23, …
## $ Improved  <ord> Some, None, None, Marked, Marked, Marked, None, Marked, None…

Take a look at the help for Arthritis to understand where this data comes from and the variable meanings.

Let’s look at the data in a tabular view. Don’t worry about understanding these functions, we’re only using it to better visualize our data via a table.

xtabs(~ Treatment + Improved, data = Arthritis) %>% 
  addmargins()

##          Improved
## Treatment None Some Marked Sum
##   Placebo   29    7      7  43
##   Treated   13    7     21  41
##   Sum       42   14     28  84

How many patients were enrolled in the clinical trial?

There were 84 patients enrolled in the clinical trial.

What is the probability a randomly selected patient received the placebo?

The probability a randomly selected patient received the placebo is 43/84.

What is the probability a randomly selected patient received the placebo and had a marked improvement?

The probability a randomly selected patient received the placebo and had a marked improvement is 7/84.

What is the probability a randomly selected patient received the placebo and the treatment?

The probability a randomly selected patient received the placebo and the treatment is 0. These are disjoint or mutually exclusive events.

What is the probability a randomly selected patient had some improvement or was on the treatment?

The probability a randomly selected patient had some improvement or was on the treatment is (14 + 41 - 7)/84 = 48/84.

Using computer simulations to calculate probabilities

Example

A vector is the basic building block in R. Let’s create a vector representing colored marbles in a box. Call it marbles.

marbles <- c("red", "red", "white", "red", "blue", "blue", "red", "blue")

Suppose we draw a single marble from our imaginary box, where all the marbles are equally likely to be selected. What is the probability the marble is blue? How about white?

The probability the marble is 0.50, the probability the marble is blue is 0.375 and the probability the marble is white is 0.125.

We can simulate this “drawing” with the sample() function.

sample(marbles, size = 1)

## [1] "red"

We produced one random outcome from this experiment. To estimate the probability of getting a white marble, we need to repeat this experiment many many times.

In the sample() function we can change the size argument and set replace = TRUE. Setting replace = TRUE allows us to draw from our population of eight marbles each time. This way we can easily simulate our marble-drawing experiment.

draw_results <- sample(marbles, size = 10000, replace = TRUE)

counts <- table(draw_results)
prop.table(counts)

## draw_results
##   blue    red  white 
## 0.3692 0.5045 0.1263

This is very close to our “true” probability!

To summarize our process:

We defined the sample space for our experiment: marbles
We simulated this experiment many many times and recorded the outcomes from each of the simulations.
We computed the relative frequency of the observed outcomes from our many simulations.

Another example

What if we want to compute the probability of getting two marbles of the same color if we make two draws with replacement? We haven’t discussed how to compute this theoretically yet, but this is what computers are good at.

Before we do this, what is your guess as to what the probability will be?

We’ll still use sample() to run our simulation many times, but we’ll use dplyr functions to compute the relative frequencies.

two_draw_results <- tibble(
  draw_1 = sample(marbles, size = 10000, replace = TRUE),
  draw_2 = sample(marbles, size = 10000, replace = TRUE)
)

two_draw_results

## # A tibble: 10,000 x 2
##    draw_1 draw_2
##    <chr>  <chr> 
##  1 red    red   
##  2 red    red   
##  3 blue   blue  
##  4 red    white 
##  5 white  blue  
##  6 blue   blue  
##  7 white  white 
##  8 red    red   
##  9 red    red   
## 10 blue   red   
## # … with 9,990 more rows

How can we add a variable to two_draw_results to see if draw_1 and draw_2 match?

two_draw_results <- two_draw_results %>% 
  mutate(color_match = draw_1 == draw_2)

two_draw_results

## # A tibble: 10,000 x 3
##    draw_1 draw_2 color_match
##    <chr>  <chr>  <lgl>      
##  1 red    red    TRUE       
##  2 red    red    TRUE       
##  3 blue   blue   TRUE       
##  4 red    white  FALSE      
##  5 white  blue   FALSE      
##  6 blue   blue   TRUE       
##  7 white  white  TRUE       
##  8 red    red    TRUE       
##  9 red    red    TRUE       
## 10 blue   red    FALSE      
## # … with 9,990 more rows

All that remains is to compute the relative frequency of the observed outcomes from our many simulations.

two_draw_results %>% 
  count(color_match) %>% 
  mutate(proportion = n / sum(n))

## # A tibble: 2 x 3
##   color_match     n proportion
## * <lgl>       <int>      <dbl>
## 1 FALSE        6026      0.603
## 2 TRUE         3974      0.397

Practice

Suppose you roll two fair six-sided dice. Which has a higher probability: the square of dice roll 1 is equal to dice roll 2; or the absolute value of the difference between dice roll 1 and dice roll 2 is equal to 4.

Perform a simulation to compute this empirical probability.

Write down your guess to the answer before you calculate it.

roll_results <- tibble(
  die_1 = sample(1:6, size = 10000, replace = TRUE),
  die_2 = sample(1:6, size = 10000, replace = TRUE)
)

roll_results

## # A tibble: 10,000 x 2
##    die_1 die_2
##    <int> <int>
##  1     5     6
##  2     4     3
##  3     5     6
##  4     2     1
##  5     1     6
##  6     4     2
##  7     6     3
##  8     5     1
##  9     2     4
## 10     3     5
## # … with 9,990 more rows

roll_results <- roll_results %>% 
  mutate(event_1 = die_1 ^ 2 == die_2,
         event_2 = abs(die_1 - die_2) == 4)

roll_results

## # A tibble: 10,000 x 4
##    die_1 die_2 event_1 event_2
##    <int> <int> <lgl>   <lgl>  
##  1     5     6 FALSE   FALSE  
##  2     4     3 FALSE   FALSE  
##  3     5     6 FALSE   FALSE  
##  4     2     1 FALSE   FALSE  
##  5     1     6 FALSE   FALSE  
##  6     4     2 FALSE   FALSE  
##  7     6     3 FALSE   FALSE  
##  8     5     1 FALSE   TRUE   
##  9     2     4 TRUE    FALSE  
## 10     3     5 FALSE   FALSE  
## # … with 9,990 more rows

roll_results %>% 
  summarise(event_1_prob = mean(event_1),
            event_2_prob = mean(event_2))

## # A tibble: 1 x 2
##   event_1_prob event_2_prob
##          <dbl>        <dbl>
## 1       0.0556        0.112