Main Ideas
- There are different types of variables.
- Visualizations and summaries of variables must be consistent with the variable type.
Load the tidyverse
package.
library(tidyverse)
There are two types of variables numeric and categorical.
Numerical variables can be classified as either continuous or discrete. Continuous numeric variables have an infinite number of values between any two values. Discrete numeric variables have a countable number of values.
Categorical variables can be classified as either nominal or ordinal. Ordinal variables have a natural ordering.
To describe the distribution of a numeric we will use the properties below.
mean
), median (median
)range
), standard deviation (sd
), interquartile range (IQR
)We will continue our investigation of home prices in Minneapolis, Minnesota.
mn_homes <- read_csv("data/mn_homes.csv")
Add a glimpse
to the code chunk below and identify the following variables as numeric continuous, numeric discrete, categorical ordinal, or categorical nominal.
glimpse(mn_homes)
## Rows: 495
## Columns: 13
## $ saleyear <dbl> 2012, 2014, 2005, 2010, 2010, 2013, 2011, 2007, 2013, 2…
## $ salemonth <dbl> 6, 7, 7, 6, 2, 9, 1, 9, 10, 6, 7, 8, 5, 2, 7, 6, 10, 6,…
## $ salesprice <dbl> 690467.0, 235571.7, 272507.7, 277767.5, 148324.1, 24287…
## $ area <dbl> 3937, 1440, 1835, 2016, 2004, 2822, 2882, 1979, 3140, 3…
## $ beds <dbl> 5, 2, 2, 3, 3, 3, 4, 3, 4, 3, 3, 3, 2, 3, 3, 6, 2, 3, 2…
## $ baths <dbl> 4, 1, 1, 2, 1, 3, 3, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1…
## $ stories <dbl> 2.5, 1.7, 1.7, 2.5, 1.0, 2.0, 1.7, 1.5, 1.5, 2.5, 1.0, …
## $ yearbuilt <dbl> 1907, 1919, 1913, 1910, 1956, 1934, 1951, 1929, 1940, 1…
## $ neighborhood <chr> "Lowry Hill", "Cooper", "Hiawatha", "King Field", "Shin…
## $ community <chr> "Calhoun-Isles", "Longfellow", "Longfellow", "Southwest…
## $ lotsize <dbl> 6192, 5160, 5040, 4875, 5060, 6307, 6500, 5600, 6350, 7…
## $ numfireplaces <dbl> 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0…
## $ fireplace <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, T…
We can use a histogram to summarize a numeric variable.
ggplot(data = mn_homes,
mapping = aes(x = salesprice)) +
geom_histogram(bins = 25) +
labs(title = "Histogram of home sale prices in Minneapolis, Minnesota",
subtitle = "2005 - 2015",
x = "Sales Price")
Change the bins
argument to adjust the number of bars of equal width on the x axis.
A density plot is another option. We just connect the boxes in a histogram with a smooth curve.
ggplot(data = mn_homes,
mapping = aes(x = salesprice)) +
geom_density() +
labs(title = "Density plot of home sale prices in Minneapolis, Minnesota",
subtitle = "2005 - 2015",
x = "Sales Price")
Side-by-side boxplots are helpful to visualize the distribution of a numeric variable across the levels of a categorical variable.
ggplot(data = mn_homes,
mapping = aes(x = community, y = salesprice)) +
geom_boxplot() +
coord_flip() +
labs(title = "Distribution of home sale prices by community",
subtitle = "2005 - 2015",
x = "Community", y = "Sales Price")
Question: What is coord_flip()
doing in the code chunk above? Try removing it to see.
The coord_flip()
function is flipping the coordinates so the horizontal becomes vertical and the vertical becomes horizontal.
Bar plots allow us to visualize categorical variables.
ggplot(data = mn_homes,
mapping = aes(x = community)) +
geom_bar() +
coord_flip() +
labs(title = "Bar plot of homes sold in each community",
x = "Community")
Segmented bar plots can be used to visualize two categorical variables.
ggplot(data = mn_homes,
mapping = aes(x = community, fill = fireplace)) +
geom_bar() +
coord_flip() +
labs(title = "Segmented bar plot of homes sold in each community",
x = "Community", fill = "Fireplace")
ggplot(data = mn_homes,
mapping = aes(x = community, fill = fireplace)) +
geom_bar(position = "fill") +
coord_flip() +
labs(title = "Segmented bar plot of homes sold in each community",
subtitle = "Minneapolis, MN 2005 - 2015",
x = "Community", fill = "Fireplace")
Question: Which of the above two visualizations should be preferred?
In the first visualization it is extremely hard to compare the proportions of homes in each neighborhood that have a fireplace because the size of the bars vary depending on how many homes in that neighborhood were sold. The second visualization gives a much better way to see which neighborhoods have a higher or lower proportion of homes sold with a fireplace. We can improve this visualization further by ordering the communities in descending order of fireplace proportion rather than alphabetically.
There is something wrong with each of the plots below. Run the code for each plot, read the error, then identify and fix the problem.
In the code below, shape = 21
and size = 0.85
should not be inside mapping = aes()
.
# incorrect
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = lotsize, y = salesprice,
shape = 21, size = 0.85))
# correct
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = lotsize, y = salesprice),
shape = 21, size = 0.85)
Here, we are missing the mapping = aes(...)
# incorrect
ggplot(data = mn_homes) +
geom_point(x = lotsize, y = area, shape = 21, size = .85)
# correct
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = lotsize, y = area),
shape = 21, size = .85)
community = color
belongs in mapping = aes()
.
# incorrect
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = lotsize, y = area),
color = community, size = .85)
# correct
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = lotsize, y = area, color = community),
size = .85)
We have a small typo and 1otsize
should be lotsize
.
# incorrect
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = 1otsize, y = area))
# correct
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = lotsize, y = area))
General principles for effective data visualization
Why is data visualization important? We will illustrate using the datasaurus_dozen
data from the datasauRus
package.
datasaurus_dozen <- read_csv("data/datasaurus_dozen.csv")
glimpse(datasaurus_dozen)
## Rows: 1,846
## Columns: 3
## $ dataset <chr> "dino", "dino", "dino", "dino", "dino", "dino", "dino", "dino…
## $ x <dbl> 55.3846, 51.5385, 46.1538, 42.8205, 40.7692, 38.7179, 35.6410…
## $ y <dbl> 97.1795, 96.0256, 94.4872, 91.4103, 88.3333, 84.8718, 79.8718…
The code below calculates the correlation, mean of y, mean of x, standard deviation of y, and standard deviation of x for each of the 13 datasets.
Question: What do you notice?
For all thirteen datasets the correlation coefficient, mean of y, mean of x, standard deviation of x, and standard deviation of y are basically identical.
datasaurus_dozen %>%
group_by(dataset) %>%
summarize(r = cor(x, y),
mean_y = mean(y),
mean_x = mean(x),
sd_x = sd(x),
sd_y = sd(y))
## # A tibble: 13 x 6
## dataset r mean_y mean_x sd_x sd_y
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 away -0.0641 47.8 54.3 16.8 26.9
## 2 bullseye -0.0686 47.8 54.3 16.8 26.9
## 3 circle -0.0683 47.8 54.3 16.8 26.9
## 4 dino -0.0645 47.8 54.3 16.8 26.9
## 5 dots -0.0603 47.8 54.3 16.8 26.9
## 6 h_lines -0.0617 47.8 54.3 16.8 26.9
## 7 high_lines -0.0685 47.8 54.3 16.8 26.9
## 8 slant_down -0.0690 47.8 54.3 16.8 26.9
## 9 slant_up -0.0686 47.8 54.3 16.8 26.9
## 10 star -0.0630 47.8 54.3 16.8 26.9
## 11 v_lines -0.0694 47.8 54.3 16.8 26.9
## 12 wide_lines -0.0666 47.8 54.3 16.8 26.9
## 13 x_shape -0.0656 47.8 54.3 16.8 26.9
Let’s visualize the relationships
ggplot(data = datasaurus_dozen,
mapping = aes(x = x, y = y)) +
geom_point(size = .5) +
facet_wrap( ~ dataset)
aesthetic | discrete | continuous |
---|---|---|
color | rainbow of colors | gradient |
size | discrete steps | linear mapping between radius and value |
shape | different shape for each | does not work |
When you are finished, remove eval = FALSE
and knit the file to see the changes.
ggplot(data = mn_homes, mapping = aes(x = yearbuilt)) +
geom_histogram(bins = 35) +
facet_wrap(~ community) +
labs(x = "Year built",
title = "Distribution of year built for homes sold in Minneapolis, MN",
subtitle = "faceted by community")