Data Visualization I

Song of the Day

White Squall -Stan Rogers

Main Ideas

Data visualization is an extremely effective way to express information and extract meaning from data.
We can build up an effective visualization systematically layer by layer using a grammar of graphics (ggplot2).

Coming Up

Remember to push lecture notes one week after the lecture date
Accept the invitation to the GitHub organization

“The simple graph has brought more information to the data analyst’s mind than any other device” - John Tukey

Lecture Notes and Exercises

Before we start the exercise, we need to configure git so that RStudio can communicate with GitHub. This requires two pieces of information: your email address and your GitHub username.

Load the tidyverse package. Recall, a package is just a bundle of shareable code.

library(tidyverse)

Exploratory data analysis (EDA) is an approach to analyzing datasets in order to summarize the main characteristics, often with visual representations of the data (today). We can also calculate summary statistics and perform data wrangling, manipulation, and transformation (next week).

We will use ggplot2 to construct visualizations. The gg in ggplot2 stands for “grammar of graphics”, a system or framework that allows us to describe the components of a graphic, building up an effective visualization layer by later.

Minneapolis Housing Data

We will introduce visualization using data on single-family homes sold in Minneapolis, Minnesota between 2005 and 2015.

Question: What happens when you click the green arrow in the code chunk below? What changes in the “Environment” pange?

mn_homes <- read_csv("data/mn_homes.csv")

glimpse(mn_homes)

## Rows: 495
## Columns: 13
## $ saleyear      <dbl> 2012, 2014, 2005, 2010, 2010, 2013, 2011, 2007, 2013, 2…
## $ salemonth     <dbl> 6, 7, 7, 6, 2, 9, 1, 9, 10, 6, 7, 8, 5, 2, 7, 6, 10, 6,…
## $ salesprice    <dbl> 690467.0, 235571.7, 272507.7, 277767.5, 148324.1, 24287…
## $ area          <dbl> 3937, 1440, 1835, 2016, 2004, 2822, 2882, 1979, 3140, 3…
## $ beds          <dbl> 5, 2, 2, 3, 3, 3, 4, 3, 4, 3, 3, 3, 2, 3, 3, 6, 2, 3, 2…
## $ baths         <dbl> 4, 1, 1, 2, 1, 3, 3, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1…
## $ stories       <dbl> 2.5, 1.7, 1.7, 2.5, 1.0, 2.0, 1.7, 1.5, 1.5, 2.5, 1.0, …
## $ yearbuilt     <dbl> 1907, 1919, 1913, 1910, 1956, 1934, 1951, 1929, 1940, 1…
## $ neighborhood  <chr> "Lowry Hill", "Cooper", "Hiawatha", "King Field", "Shin…
## $ community     <chr> "Calhoun-Isles", "Longfellow", "Longfellow", "Southwest…
## $ lotsize       <dbl> 6192, 5160, 5040, 4875, 5060, 6307, 6500, 5600, 6350, 7…
## $ numfireplaces <dbl> 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0…
## $ fireplace     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, T…

Question: What does each row represent? Each column?

Each row represents a house sold in Minneapolis between 2005 and 2015 and each column represents a house attribute (number of beds, sale month, area, etc).

First Visualization

ggplot creates the initial base coordinate system that we will add layers to. We first specify the dataset we will use with data = mn_homes. The mapping argument is paired with an aesthetic (aes), which tells us how the variables in our dataset should be mapped to the visual properties of the graph.

Question: What does the code chunk below do?

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice))

ggplot() initializes a ggplot object. We specify the input data frame and plot aesthetics that will be used in all layers. Running the code chunk above reveals an empty plot with salesprice on y and area on x.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point()

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   geom_smooth()

Run ?geom_smooth in the console. What does this function do?

geom_smooth() adds a curve to the plot, the smoothed conditional means.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   geom_smooth() +
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)")

The procedure used to construct plots can be summarized using the code below.

ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], y = [y-variable])) +
   geom_xxx() +
   geom_xxx() + 
  other options

Question: What do you think eval = FALSE is doing in the code chunk above?

The code chunk argument eval controls whether or not the code chunk should be evaluated. Since the code chunk above contains pseudocode, we want to display it, but not evaluate it, so we include eval = FALSE.

Aesthetics

An aesthetic is a visual property of one of the objects in your plot.

shape
color
size
alpha (transparency)

We can map a variable in our dataset to a color, a size, a transparency, and so on.

Question: What will the visualization look like below? Write your answer down before running the code.

It will display a scatterplot of salesprice versus area with points colored according to whether or not the house has a fireplace.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice, color = fireplace)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)")

Question: What about this one?

It will display a scatterplot of salesprice versus area with points shaped differently based on whether or not the house has a fireplace.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice, shape = fireplace)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)")

Question: This one?

It will display a scatterplot of salesprice versus area with points colored according to whether or not the house has a fireplace and sized according to the lot size.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice, color = fireplace, 
                     size = lotsize)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)")

Question: Are the above visualizations effective? Why or why not? How might you improve them?

Currently, none of these visualizations are particularly effective. There is a high density of houses with areas less than 3,000 square feet and sale prices under $500,000 so it is hard to see the relationship. Using triangles and circles to denote the presence or absence of a fireplace is a poor aesthetic choice and sizing the points by lot size means all of the points run in to each other. Considering the first plot: to improve, you can consider adjusting the transparency (alpha), adding a geom_smooth() adding a more informative title, and fixing the axis labels ($1,000,000 not 1000000).

Question: What is the difference between the two plots below?

Use aes to describe how variables in your dataset are mapped to visual properties of the graph. If you want to do customization not based on a variable, use arguments in geom_xxx.

In the first plot below, color = "blue" is included in aes, so it is treated as a mapping betwen a variable and a visual property of the graph. “blue” is treated as a categorical variable that takes a single value “blue” and we give all of the values the same color. Chance “blue” to something like “not-a-color” in the first plot. What do you notice?

To fix, this remove color = "blue" from aes as in the second plot.

ggplot(data = mn_homes) + 
  geom_point(mapping = aes(x = area, y = salesprice, color = "blue"))

ggplot(data = mn_homes) + 
  geom_point(mapping = aes(x = area, y = salesprice), color = "blue")

Mapping in the ggplot function is global, meaning they apply to every layer we add. Mapping in a particular geom_xxx function treats the mappings as local.

Question: Create a scatterplot using variables of your choosing using the mn_homes data.

ggplot(data = mn_homes,
       mapping = aes(x = lotsize, y = area)) + 
   geom_point()

Question: Modify your scatterplot above by coloring the points for each community.

ggplot(data = mn_homes,
       mapping = aes(x = lotsize, y = area, color = community)) + 
   geom_point()

Faceting

We can use smaller plots to display different subsets of the data using faceting. This is helpful to examine conditional relationships.

Let’s try a few simple examples of faceting. Note that these plots should be improved by careful consideration of labels, aesthetics, etc.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
   facet_grid(. ~ beds)

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
   facet_grid(beds ~ .)

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
   facet_grid(beds ~ baths)

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
   facet_wrap(~ community)

facet_grid() creates a two-dimensional grid of plots based on all possible combinations of the specified rows and columns (rows ~ columns). It displays all plots even if some are empty. This function is helpful when we want to investigate a relationship for all possible combinations of two categorical variables. You can include a . instead of a variable to not facet either the rows or columns.

2d grid
rows ~ cols
use . for no faceting

facet_wrap creates a one-dimensional ribbon of plots (not a grid) and wraps them into two dimensions. This function is helpful when we have a single variable and want to investigate a relationship for all possible values of this variable.

1d ribbon wrapped into 2d

Practice

All plots you develop in STA 199 should include an informative title, labeled axes, and any relevant annotations. You should also give careful consideration to aesthetic choices.

Code and narrative should not exceed the 80 character limit.

Modify the code outline to make the changes described below.

Change the color of the points to green.
Add an alpha aesthetic to make the points more transparent.
Add labels for the x axis and y axis.
Add an informative title.

When you are finished, remove eval = FALSE and knit the file to see the changes.

ggplot(data = mn_homes, 
       mapping = aes(x = lotsize, y = salesprice)) + 
   geom_point(color = "green", alpha = 0.5) + 
   labs(title = "Sales price versus lot size for Minneapolis Homes",
        x = "Lot Size (square feet)", y = "Sales Price (USD)")

Modify the code outline to make the changes described below.

Create a histogram of lotsize.
Modify the histogram by adding fill = "green" inside the geom_histogram() function.
Modify the histogram by adding color = "red" inside the geom_histogram() function.