Data Visualization II

Song of the Day

Moldau -Smetana

Main Ideas

There are different types of variables.
Visualizations and summaries of variables must be consistent with the variable type.

Coming Up

Homework #01 goes live tonight
Lab #02 tomorrow
Accept invitation to the GitHub organization

Lecture Notes and Exercises

Load the tidyverse package.

library(tidyverse)

There are two types of variables numeric and categorical.

Types of variables

Numerical variables can be classified as either continuous or discrete. Continuous numeric variables have an infinite number of values between any two values. Discrete numeric variables have a countable number of values.

height: continuous numeric
number of siblings: discrete numeric

Categorical variables can be classified as either nominal or ordinal. Ordinal variables have a natural ordering.

hair color: nominal categorical
education: ordinal categorical

Numeric Variables

To describe the distribution of a numeric we will use the properties below.

shape
- skewness: right-skewed, left-skewed, symmetric
- modality: unimodal, bimodal, multimodal, uniform
center: mean (mean), median (median)
spread: range (range), standard deviation (sd), interquartile range (IQR)
outliers: observations outside the pattern of the data

We will continue our investigation of home prices in Minneapolis, Minnesota.

mn_homes <- read_csv("data/mn_homes.csv")

Add a glimpse to the code chunk below and identify the following variables as numeric continuous, numeric discrete, categorical ordinal, or categorical nominal.

area: continuous numeric
beds: discrete numeric
community: nominal categorical

glimpse(mn_homes)

## Rows: 495
## Columns: 13
## $ saleyear      <dbl> 2012, 2014, 2005, 2010, 2010, 2013, 2011, 2007, 2013, 2…
## $ salemonth     <dbl> 6, 7, 7, 6, 2, 9, 1, 9, 10, 6, 7, 8, 5, 2, 7, 6, 10, 6,…
## $ salesprice    <dbl> 690467.0, 235571.7, 272507.7, 277767.5, 148324.1, 24287…
## $ area          <dbl> 3937, 1440, 1835, 2016, 2004, 2822, 2882, 1979, 3140, 3…
## $ beds          <dbl> 5, 2, 2, 3, 3, 3, 4, 3, 4, 3, 3, 3, 2, 3, 3, 6, 2, 3, 2…
## $ baths         <dbl> 4, 1, 1, 2, 1, 3, 3, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1…
## $ stories       <dbl> 2.5, 1.7, 1.7, 2.5, 1.0, 2.0, 1.7, 1.5, 1.5, 2.5, 1.0, …
## $ yearbuilt     <dbl> 1907, 1919, 1913, 1910, 1956, 1934, 1951, 1929, 1940, 1…
## $ neighborhood  <chr> "Lowry Hill", "Cooper", "Hiawatha", "King Field", "Shin…
## $ community     <chr> "Calhoun-Isles", "Longfellow", "Longfellow", "Southwest…
## $ lotsize       <dbl> 6192, 5160, 5040, 4875, 5060, 6307, 6500, 5600, 6350, 7…
## $ numfireplaces <dbl> 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0…
## $ fireplace     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, T…

We can use a histogram to summarize a numeric variable.

ggplot(data = mn_homes, 
       mapping = aes(x = salesprice)) + 
   geom_histogram(bins = 25) +
  labs(title = "Histogram of home sale prices in Minneapolis, Minnesota",
       subtitle = "2005 - 2015",
       x = "Sales Price")

Change the bins argument to adjust the number of bars of equal width on the x axis.

A density plot is another option. We just connect the boxes in a histogram with a smooth curve.

ggplot(data = mn_homes, 
       mapping = aes(x = salesprice)) + 
   geom_density() +
  labs(title = "Density plot of home sale prices in Minneapolis, Minnesota",
       subtitle = "2005 - 2015",
       x = "Sales Price")

Side-by-side boxplots are helpful to visualize the distribution of a numeric variable across the levels of a categorical variable.

ggplot(data = mn_homes, 
       mapping = aes(x = community, y = salesprice)) + 
   geom_boxplot() +
  coord_flip() +
  labs(title = "Distribution of home sale prices by community",
       subtitle = "2005 - 2015",
       x = "Community", y = "Sales Price")

Question: What is coord_flip() doing in the code chunk above? Try removing it to see.

The coord_flip() function is flipping the coordinates so the horizontal becomes vertical and the vertical becomes horizontal.

Categorical Variables

Bar plots allow us to visualize categorical variables.

ggplot(data = mn_homes,
       mapping = aes(x = community)) + 
  geom_bar() +
  coord_flip() +
  labs(title = "Bar plot of homes sold in each community",
       x = "Community")

Segmented bar plots can be used to visualize two categorical variables.

ggplot(data = mn_homes,
       mapping = aes(x = community, fill = fireplace)) + 
  geom_bar() +
  coord_flip() +
  labs(title = "Segmented bar plot of homes sold in each community",
       x = "Community", fill = "Fireplace")

ggplot(data = mn_homes, 
       mapping = aes(x = community, fill = fireplace)) + 
  geom_bar(position = "fill") +
  coord_flip() +
  labs(title = "Segmented bar plot of homes sold in each community",
       subtitle = "Minneapolis, MN 2005 - 2015",
       x = "Community", fill = "Fireplace")

Question: Which of the above two visualizations should be preferred?

In the first visualization it is extremely hard to compare the proportions of homes in each neighborhood that have a fireplace because the size of the bars vary depending on how many homes in that neighborhood were sold. The second visualization gives a much better way to see which neighborhoods have a higher or lower proportion of homes sold with a fireplace. We can improve this visualization further by ordering the communities in descending order of fireplace proportion rather than alphabetically.

There is something wrong with each of the plots below. Run the code for each plot, read the error, then identify and fix the problem.

Plot #1

In the code below, shape = 21 and size = 0.85 should not be inside mapping = aes().

# incorrect
ggplot(data = mn_homes) + 
  geom_point(mapping = aes(x = lotsize, y = salesprice,
                           shape = 21, size = 0.85))

# correct
ggplot(data = mn_homes) + 
  geom_point(mapping = aes(x = lotsize, y = salesprice),
                           shape = 21, size = 0.85)

Plot #2

Here, we are missing the mapping = aes(...)

# incorrect
ggplot(data = mn_homes) + 
  geom_point(x = lotsize, y = area, shape = 21, size = .85)

# correct
ggplot(data = mn_homes) + 
  geom_point(mapping = aes(x = lotsize, y = area),
             shape = 21, size = .85)

Plot #3

community = color belongs in mapping = aes().

# incorrect
ggplot(data = mn_homes) +
  geom_point(mapping = aes(x = lotsize, y = area),
             color = community, size = .85)

# correct
ggplot(data = mn_homes) +
  geom_point(mapping = aes(x = lotsize, y = area, color = community),
             size = .85)

Plot #4

We have a small typo and 1otsize should be lotsize.

# incorrect
ggplot(data = mn_homes) +
  geom_point(mapping = aes(x = 1otsize, y = area))

# correct
ggplot(data = mn_homes) +
  geom_point(mapping = aes(x = lotsize, y = area))

General principles for effective data visualization

keep it simple
use color effectively
tell a story

Why is data visualization important? We will illustrate using the datasaurus_dozen data from the datasauRus package.

datasaurus_dozen <- read_csv("data/datasaurus_dozen.csv")

glimpse(datasaurus_dozen)

## Rows: 1,846
## Columns: 3
## $ dataset <chr> "dino", "dino", "dino", "dino", "dino", "dino", "dino", "dino…
## $ x       <dbl> 55.3846, 51.5385, 46.1538, 42.8205, 40.7692, 38.7179, 35.6410…
## $ y       <dbl> 97.1795, 96.0256, 94.4872, 91.4103, 88.3333, 84.8718, 79.8718…

The code below calculates the correlation, mean of y, mean of x, standard deviation of y, and standard deviation of x for each of the 13 datasets.

Question: What do you notice?

For all thirteen datasets the correlation coefficient, mean of y, mean of x, standard deviation of x, and standard deviation of y are basically identical.

datasaurus_dozen %>% 
   group_by(dataset) %>%
   summarize(r = cor(x, y), 
             mean_y = mean(y),
             mean_x = mean(x),
             sd_x = sd(x),
             sd_y = sd(y))

## # A tibble: 13 x 6
##    dataset          r mean_y mean_x  sd_x  sd_y
##    <chr>        <dbl>  <dbl>  <dbl> <dbl> <dbl>
##  1 away       -0.0641   47.8   54.3  16.8  26.9
##  2 bullseye   -0.0686   47.8   54.3  16.8  26.9
##  3 circle     -0.0683   47.8   54.3  16.8  26.9
##  4 dino       -0.0645   47.8   54.3  16.8  26.9
##  5 dots       -0.0603   47.8   54.3  16.8  26.9
##  6 h_lines    -0.0617   47.8   54.3  16.8  26.9
##  7 high_lines -0.0685   47.8   54.3  16.8  26.9
##  8 slant_down -0.0690   47.8   54.3  16.8  26.9
##  9 slant_up   -0.0686   47.8   54.3  16.8  26.9
## 10 star       -0.0630   47.8   54.3  16.8  26.9
## 11 v_lines    -0.0694   47.8   54.3  16.8  26.9
## 12 wide_lines -0.0666   47.8   54.3  16.8  26.9
## 13 x_shape    -0.0656   47.8   54.3  16.8  26.9

Let’s visualize the relationships

ggplot(data = datasaurus_dozen, 
       mapping = aes(x = x, y = y)) + 
   geom_point(size = .5) + 
   facet_wrap( ~ dataset)

aesthetic	discrete	continuous
color	rainbow of colors	gradient
size	discrete steps	linear mapping between radius and value
shape	different shape for each	does not work

Practice

Modify the code outline to create a faceted histogram examining the distribution of year built within each community.

When you are finished, remove eval = FALSE and knit the file to see the changes.

ggplot(data = mn_homes, mapping = aes(x = yearbuilt)) +
  geom_histogram(bins = 35) +
  facet_wrap(~ community) +
  labs(x = "Year built", 
      title = "Distribution of year built for homes sold in Minneapolis, MN", 
      subtitle = "faceted by community")