Song of the Day

Main Ideas

Coming Up

“The simple graph has brought more information to the data analyst’s mind than any other device” - John Tukey

Lecture Notes and Exercises

Before we start the exercise, we need to configure git so that RStudio can communicate with GitHub. This requires two pieces of information: your email address and your GitHub username.

Load the tidyverse package. Recall, a package is just a bundle of shareable code.

library(tidyverse)

Exploratory data analysis (EDA) is an approach to analyzing datasets in order to summarize the main characteristics, often with visual representations of the data (today). We can also calculate summary statistics and perform data wrangling, manipulation, and transformation (next week).

We will use ggplot2 to construct visualizations. The gg in ggplot2 stands for “grammar of graphics”, a system or framework that allows us to describe the components of a graphic, building up an effective visualization layer by later.

Minneapolis Housing Data

We will introduce visualization using data on single-family homes sold in Minneapolis, Minnesota between 2005 and 2015.

Question: What happens when you click the green arrow in the code chunk below? What changes in the “Environment” pange?

mn_homes <- read_csv("data/mn_homes.csv")
glimpse(mn_homes)
## Rows: 495
## Columns: 13
## $ saleyear      <dbl> 2012, 2014, 2005, 2010, 2010, 2013, 2011, 2007, 2013, 2…
## $ salemonth     <dbl> 6, 7, 7, 6, 2, 9, 1, 9, 10, 6, 7, 8, 5, 2, 7, 6, 10, 6,…
## $ salesprice    <dbl> 690467.0, 235571.7, 272507.7, 277767.5, 148324.1, 24287…
## $ area          <dbl> 3937, 1440, 1835, 2016, 2004, 2822, 2882, 1979, 3140, 3…
## $ beds          <dbl> 5, 2, 2, 3, 3, 3, 4, 3, 4, 3, 3, 3, 2, 3, 3, 6, 2, 3, 2…
## $ baths         <dbl> 4, 1, 1, 2, 1, 3, 3, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1…
## $ stories       <dbl> 2.5, 1.7, 1.7, 2.5, 1.0, 2.0, 1.7, 1.5, 1.5, 2.5, 1.0, …
## $ yearbuilt     <dbl> 1907, 1919, 1913, 1910, 1956, 1934, 1951, 1929, 1940, 1…
## $ neighborhood  <chr> "Lowry Hill", "Cooper", "Hiawatha", "King Field", "Shin…
## $ community     <chr> "Calhoun-Isles", "Longfellow", "Longfellow", "Southwest…
## $ lotsize       <dbl> 6192, 5160, 5040, 4875, 5060, 6307, 6500, 5600, 6350, 7…
## $ numfireplaces <dbl> 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0…
## $ fireplace     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, T…

Question: What does each row represent? Each column?

Each row represents a house sold in Minneapolis between 2005 and 2015 and each column represents a house attribute (number of beds, sale month, area, etc).

First Visualization

ggplot creates the initial base coordinate system that we will add layers to. We first specify the dataset we will use with data = mn_homes. The mapping argument is paired with an aesthetic (aes), which tells us how the variables in our dataset should be mapped to the visual properties of the graph.

Question: What does the code chunk below do?

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice))

ggplot() initializes a ggplot object. We specify the input data frame and plot aesthetics that will be used in all layers. Running the code chunk above reveals an empty plot with salesprice on y and area on x.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point()

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   geom_smooth()

Run ?geom_smooth in the console. What does this function do?

geom_smooth() adds a curve to the plot, the smoothed conditional means.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   geom_smooth() +
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)")

The procedure used to construct plots can be summarized using the code below.

ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], y = [y-variable])) +
   geom_xxx() +
   geom_xxx() + 
  other options

Question: What do you think eval = FALSE is doing in the code chunk above?

The code chunk argument eval controls whether or not the code chunk should be evaluated. Since the code chunk above contains pseudocode, we want to display it, but not evaluate it, so we include eval = FALSE.

Aesthetics

An aesthetic is a visual property of one of the objects in your plot.

We can map a variable in our dataset to a color, a size, a transparency, and so on.

Question: What will the visualization look like below? Write your answer down before running the code.

It will display a scatterplot of salesprice versus area with points colored according to whether or not the house has a fireplace.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice, color = fireplace)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)")

Question: What about this one?

It will display a scatterplot of salesprice versus area with points shaped differently based on whether or not the house has a fireplace.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice, shape = fireplace)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)")

Question: This one?

It will display a scatterplot of salesprice versus area with points colored according to whether or not the house has a fireplace and sized according to the lot size.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice, color = fireplace, 
                     size = lotsize)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)")

Question: Are the above visualizations effective? Why or why not? How might you improve them?

Currently, none of these visualizations are particularly effective. There is a high density of houses with areas less than 3,000 square feet and sale prices under $500,000 so it is hard to see the relationship. Using triangles and circles to denote the presence or absence of a fireplace is a poor aesthetic choice and sizing the points by lot size means all of the points run in to each other. Considering the first plot: to improve, you can consider adjusting the transparency (alpha), adding a geom_smooth() adding a more informative title, and fixing the axis labels ($1,000,000 not 1000000).

Question: What is the difference between the two plots below?

Use aes to describe how variables in your dataset are mapped to visual properties of the graph. If you want to do customization not based on a variable, use arguments in geom_xxx.

In the first plot below, color = "blue" is included in aes, so it is treated as a mapping betwen a variable and a visual property of the graph. “blue” is treated as a categorical variable that takes a single value “blue” and we give all of the values the same color. Chance “blue” to something like “not-a-color” in the first plot. What do you notice?

To fix, this remove color = "blue" from aes as in the second plot.

ggplot(data = mn_homes) + 
  geom_point(mapping = aes(x = area, y = salesprice, color = "blue"))

ggplot(data = mn_homes) + 
  geom_point(mapping = aes(x = area, y = salesprice), color = "blue")

Mapping in the ggplot function is global, meaning they apply to every layer we add. Mapping in a particular geom_xxx function treats the mappings as local.

Question: Create a scatterplot using variables of your choosing using the mn_homes data.

ggplot(data = mn_homes,
       mapping = aes(x = lotsize, y = area)) + 
   geom_point()

Question: Modify your scatterplot above by coloring the points for each community.

ggplot(data = mn_homes,
       mapping = aes(x = lotsize, y = area, color = community)) + 
   geom_point()

Faceting

We can use smaller plots to display different subsets of the data using faceting. This is helpful to examine conditional relationships.

Let’s try a few simple examples of faceting. Note that these plots should be improved by careful consideration of labels, aesthetics, etc.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
   facet_grid(. ~ beds)

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
   facet_grid(beds ~ .)

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
   facet_grid(beds ~ baths)

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
   facet_wrap(~ community)

facet_grid() creates a two-dimensional grid of plots based on all possible combinations of the specified rows and columns (rows ~ columns). It displays all plots even if some are empty. This function is helpful when we want to investigate a relationship for all possible combinations of two categorical variables. You can include a . instead of a variable to not facet either the rows or columns.

facet_wrap creates a one-dimensional ribbon of plots (not a grid) and wraps them into two dimensions. This function is helpful when we have a single variable and want to investigate a relationship for all possible values of this variable.

Practice

All plots you develop in STA 199 should include an informative title, labeled axes, and any relevant annotations. You should also give careful consideration to aesthetic choices.

Code and narrative should not exceed the 80 character limit.

  1. Modify the code outline to make the changes described below.

When you are finished, remove eval = FALSE and knit the file to see the changes.

ggplot(data = mn_homes, 
       mapping = aes(x = lotsize, y = salesprice)) + 
   geom_point(color = "green", alpha = 0.5) + 
   labs(title = "Sales price versus lot size for Minneapolis Homes",
        x = "Lot Size (square feet)", y = "Sales Price (USD)")

  1. Modify the code outline to make the changes described below.

When you are finished, remove eval = FALSE and knit the file to see the changes.

ggplot(data = mn_homes, 
       mapping = aes(x = lotsize)) +
  geom_histogram(fill = "green", color = "red") +
  labs(title = "Histogram of lot size for Minneapolis homes",
       x = "Lot size (square feet)", y = "Count")

Question: What is the difference between the color and fill arguments?

  1. Develop an effective visualization on your own using the code chunk provided below. Use three variables and at least one aesthetic mapping.
ggplot(data = mn_homes, 
       aes(x = yearbuilt, fill = fireplace)) + 
   geom_density(alpha = 0.50) + 
  labs(x = "Year Built", y = "", fill = "Fireplace",
       title = "Fireplaces through time by community") +
   facet_wrap( ~ community)

Additional Resources