Main ideas
- Use formulas to compute probabilities from tabular data
- Compute empirical probabilities in
R
via simulation
R
via simulationWe will use the tidyverse
as usual and vcd
to work with the Arthritis
data. Note you may need to install vcd
using install.packages()
.
library(tidyverse)
library(vcd)
data(Arthritis)
glimpse(Arthritis)
## Rows: 84
## Columns: 5
## $ ID <int> 57, 46, 77, 17, 36, 23, 75, 39, 33, 55, 30, 5, 63, 83, 66, 4…
## $ Treatment <fct> Treated, Treated, Treated, Treated, Treated, Treated, Treate…
## $ Sex <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, …
## $ Age <int> 27, 29, 30, 32, 46, 58, 59, 59, 63, 63, 64, 64, 69, 70, 23, …
## $ Improved <ord> Some, None, None, Marked, Marked, Marked, None, Marked, None…
Take a look at the help for Arthritis
to understand where this data comes from and the variable meanings.
Let’s look at the data in a tabular view. Don’t worry about understanding these functions, we’re only using it to better visualize our data via a table.
xtabs(~ Treatment + Improved, data = Arthritis) %>%
addmargins()
## Improved
## Treatment None Some Marked Sum
## Placebo 29 7 7 43
## Treated 13 7 21 41
## Sum 42 14 28 84
There were 84 patients enrolled in the clinical trial.
The probability a randomly selected patient received the placebo is 43/84.
The probability a randomly selected patient received the placebo and had a marked improvement is 7/84.
The probability a randomly selected patient received the placebo and the treatment is 0. These are disjoint or mutually exclusive events.
The probability a randomly selected patient had some improvement or was on the treatment is (14 + 41 - 7)/84 = 48/84.
A vector is the basic building block in R
. Let’s create a vector representing colored marbles in a box. Call it marbles
.
marbles <- c("red", "red", "white", "red", "blue", "blue", "red", "blue")
Suppose we draw a single marble from our imaginary box, where all the marbles are equally likely to be selected. What is the probability the marble is blue? How about white?
The probability the marble is 0.50, the probability the marble is blue is 0.375 and the probability the marble is white is 0.125.
We can simulate this “drawing” with the sample()
function.
sample(marbles, size = 1)
## [1] "red"
We produced one random outcome from this experiment. To estimate the probability of getting a white marble, we need to repeat this experiment many many times.
In the sample()
function we can change the size
argument and set replace = TRUE
. Setting replace = TRUE
allows us to draw from our population of eight marbles each time. This way we can easily simulate our marble-drawing experiment.
draw_results <- sample(marbles, size = 10000, replace = TRUE)
counts <- table(draw_results)
prop.table(counts)
## draw_results
## blue red white
## 0.3692 0.5045 0.1263
This is very close to our “true” probability!
To summarize our process:
marbles
What if we want to compute the probability of getting two marbles of the same color if we make two draws with replacement? We haven’t discussed how to compute this theoretically yet, but this is what computers are good at.
Before we do this, what is your guess as to what the probability will be?
We’ll still use sample()
to run our simulation many times, but we’ll use dplyr
functions to compute the relative frequencies.
two_draw_results <- tibble(
draw_1 = sample(marbles, size = 10000, replace = TRUE),
draw_2 = sample(marbles, size = 10000, replace = TRUE)
)
two_draw_results
## # A tibble: 10,000 x 2
## draw_1 draw_2
## <chr> <chr>
## 1 red red
## 2 red red
## 3 blue blue
## 4 red white
## 5 white blue
## 6 blue blue
## 7 white white
## 8 red red
## 9 red red
## 10 blue red
## # … with 9,990 more rows
How can we add a variable to two_draw_results
to see if draw_1
and draw_2
match?
two_draw_results <- two_draw_results %>%
mutate(color_match = draw_1 == draw_2)
two_draw_results
## # A tibble: 10,000 x 3
## draw_1 draw_2 color_match
## <chr> <chr> <lgl>
## 1 red red TRUE
## 2 red red TRUE
## 3 blue blue TRUE
## 4 red white FALSE
## 5 white blue FALSE
## 6 blue blue TRUE
## 7 white white TRUE
## 8 red red TRUE
## 9 red red TRUE
## 10 blue red FALSE
## # … with 9,990 more rows
All that remains is to compute the relative frequency of the observed outcomes from our many simulations.
two_draw_results %>%
count(color_match) %>%
mutate(proportion = n / sum(n))
## # A tibble: 2 x 3
## color_match n proportion
## * <lgl> <int> <dbl>
## 1 FALSE 6026 0.603
## 2 TRUE 3974 0.397
Suppose you roll two fair six-sided dice. Which has a higher probability: the square of dice roll 1 is equal to dice roll 2; or the absolute value of the difference between dice roll 1 and dice roll 2 is equal to 4.
Perform a simulation to compute this empirical probability.
Write down your guess to the answer before you calculate it.
roll_results <- tibble(
die_1 = sample(1:6, size = 10000, replace = TRUE),
die_2 = sample(1:6, size = 10000, replace = TRUE)
)
roll_results
## # A tibble: 10,000 x 2
## die_1 die_2
## <int> <int>
## 1 5 6
## 2 4 3
## 3 5 6
## 4 2 1
## 5 1 6
## 6 4 2
## 7 6 3
## 8 5 1
## 9 2 4
## 10 3 5
## # … with 9,990 more rows
roll_results <- roll_results %>%
mutate(event_1 = die_1 ^ 2 == die_2,
event_2 = abs(die_1 - die_2) == 4)
roll_results
## # A tibble: 10,000 x 4
## die_1 die_2 event_1 event_2
## <int> <int> <lgl> <lgl>
## 1 5 6 FALSE FALSE
## 2 4 3 FALSE FALSE
## 3 5 6 FALSE FALSE
## 4 2 1 FALSE FALSE
## 5 1 6 FALSE FALSE
## 6 4 2 FALSE FALSE
## 7 6 3 FALSE FALSE
## 8 5 1 FALSE TRUE
## 9 2 4 TRUE FALSE
## 10 3 5 FALSE FALSE
## # … with 9,990 more rows
roll_results %>%
summarise(event_1_prob = mean(event_1),
event_2_prob = mean(event_2))
## # A tibble: 1 x 2
## event_1_prob event_2_prob
## <dbl> <dbl>
## 1 0.0556 0.112