Main Ideas
- Data visualization is an extremely effective way to express information and extract meaning from data.
- We can build up an effective visualization systematically layer by layer using a grammar of graphics (
ggplot2
).
ggplot2
).“The simple graph has brought more information to the data analyst’s mind than any other device” - John Tukey
Before we start the exercise, we need to configure git so that RStudio can communicate with GitHub. This requires two pieces of information: your email address and your GitHub username.
Load the tidyverse
package. Recall, a package is just a bundle of shareable code.
library(tidyverse)
Exploratory data analysis (EDA) is an approach to analyzing datasets in order to summarize the main characteristics, often with visual representations of the data (today). We can also calculate summary statistics and perform data wrangling, manipulation, and transformation (next week).
We will use ggplot2
to construct visualizations. The gg in ggplot2
stands for “grammar of graphics”, a system or framework that allows us to describe the components of a graphic, building up an effective visualization layer by later.
We will introduce visualization using data on single-family homes sold in Minneapolis, Minnesota between 2005 and 2015.
Question: What happens when you click the green arrow in the code chunk below? What changes in the “Environment” pange?
mn_homes <- read_csv("data/mn_homes.csv")
glimpse(mn_homes)
## Rows: 495
## Columns: 13
## $ saleyear <dbl> 2012, 2014, 2005, 2010, 2010, 2013, 2011, 2007, 2013, 2…
## $ salemonth <dbl> 6, 7, 7, 6, 2, 9, 1, 9, 10, 6, 7, 8, 5, 2, 7, 6, 10, 6,…
## $ salesprice <dbl> 690467.0, 235571.7, 272507.7, 277767.5, 148324.1, 24287…
## $ area <dbl> 3937, 1440, 1835, 2016, 2004, 2822, 2882, 1979, 3140, 3…
## $ beds <dbl> 5, 2, 2, 3, 3, 3, 4, 3, 4, 3, 3, 3, 2, 3, 3, 6, 2, 3, 2…
## $ baths <dbl> 4, 1, 1, 2, 1, 3, 3, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1…
## $ stories <dbl> 2.5, 1.7, 1.7, 2.5, 1.0, 2.0, 1.7, 1.5, 1.5, 2.5, 1.0, …
## $ yearbuilt <dbl> 1907, 1919, 1913, 1910, 1956, 1934, 1951, 1929, 1940, 1…
## $ neighborhood <chr> "Lowry Hill", "Cooper", "Hiawatha", "King Field", "Shin…
## $ community <chr> "Calhoun-Isles", "Longfellow", "Longfellow", "Southwest…
## $ lotsize <dbl> 6192, 5160, 5040, 4875, 5060, 6307, 6500, 5600, 6350, 7…
## $ numfireplaces <dbl> 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0…
## $ fireplace <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, T…
Question: What does each row represent? Each column?
Each row represents a house sold in Minneapolis between 2005 and 2015 and each column represents a house attribute (number of beds, sale month, area, etc).
ggplot
creates the initial base coordinate system that we will add layers to. We first specify the dataset we will use with data = mn_homes
. The mapping
argument is paired with an aesthetic (aes
), which tells us how the variables in our dataset should be mapped to the visual properties of the graph.
Question: What does the code chunk below do?
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice))
ggplot()
initializes a ggplot object. We specify the input data frame and plot aesthetics that will be used in all layers. Running the code chunk above reveals an empty plot with salesprice
on y and area
on x.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point()
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
geom_smooth()
Run ?geom_smooth
in the console. What does this function do?
geom_smooth()
adds a curve to the plot, the smoothed conditional means.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
geom_smooth() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)")
The procedure used to construct plots can be summarized using the code below.
ggplot(data = [dataset],
mapping = aes(x = [x-variable], y = [y-variable])) +
geom_xxx() +
geom_xxx() +
other options
Question: What do you think eval = FALSE
is doing in the code chunk above?
The code chunk argument eval
controls whether or not the code chunk should be evaluated. Since the code chunk above contains pseudocode, we want to display it, but not evaluate it, so we include eval = FALSE
.
An aesthetic is a visual property of one of the objects in your plot.
We can map a variable in our dataset to a color, a size, a transparency, and so on.
Question: What will the visualization look like below? Write your answer down before running the code.
It will display a scatterplot of salesprice
versus area
with points colored according to whether or not the house has a fireplace.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice, color = fireplace)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)")
Question: What about this one?
It will display a scatterplot of salesprice
versus area
with points shaped differently based on whether or not the house has a fireplace.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice, shape = fireplace)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)")
Question: This one?
It will display a scatterplot of salesprice
versus area
with points colored according to whether or not the house has a fireplace and sized according to the lot size.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice, color = fireplace,
size = lotsize)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)")
Question: Are the above visualizations effective? Why or why not? How might you improve them?
Currently, none of these visualizations are particularly effective. There is a high density of houses with areas less than 3,000 square feet and sale prices under $500,000 so it is hard to see the relationship. Using triangles and circles to denote the presence or absence of a fireplace is a poor aesthetic choice and sizing the points by lot size means all of the points run in to each other. Considering the first plot: to improve, you can consider adjusting the transparency (alpha
), adding a geom_smooth()
adding a more informative title, and fixing the axis labels ($1,000,000 not 1000000).
Question: What is the difference between the two plots below?
Use aes
to describe how variables in your dataset are mapped to visual properties of the graph. If you want to do customization not based on a variable, use arguments in geom_xxx
.
In the first plot below, color = "blue"
is included in aes
, so it is treated as a mapping betwen a variable and a visual property of the graph. “blue” is treated as a categorical variable that takes a single value “blue” and we give all of the values the same color. Chance “blue” to something like “not-a-color” in the first plot. What do you notice?
To fix, this remove color = "blue"
from aes
as in the second plot.
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = area, y = salesprice, color = "blue"))
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = area, y = salesprice), color = "blue")
Mapping in the ggplot
function is global, meaning they apply to every layer we add. Mapping in a particular geom_xxx
function treats the mappings as local.
Question: Create a scatterplot using variables of your choosing using the mn_homes
data.
ggplot(data = mn_homes,
mapping = aes(x = lotsize, y = area)) +
geom_point()
Question: Modify your scatterplot above by coloring the points for each community.
ggplot(data = mn_homes,
mapping = aes(x = lotsize, y = area, color = community)) +
geom_point()
We can use smaller plots to display different subsets of the data using faceting. This is helpful to examine conditional relationships.
Let’s try a few simple examples of faceting. Note that these plots should be improved by careful consideration of labels, aesthetics, etc.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)") +
facet_grid(. ~ beds)
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)") +
facet_grid(beds ~ .)
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)") +
facet_grid(beds ~ baths)
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)") +
facet_wrap(~ community)
facet_grid()
creates a two-dimensional grid of plots based on all possible combinations of the specified rows and columns (rows ~ columns
). It displays all plots even if some are empty. This function is helpful when we want to investigate a relationship for all possible combinations of two categorical variables. You can include a .
instead of a variable to not facet either the rows or columns.
facet_wrap
creates a one-dimensional ribbon of plots (not a grid) and wraps them into two dimensions. This function is helpful when we have a single variable and want to investigate a relationship for all possible values of this variable.
All plots you develop in STA 199 should include an informative title, labeled axes, and any relevant annotations. You should also give careful consideration to aesthetic choices.
Code and narrative should not exceed the 80 character limit.
alpha
aesthetic to make the points more transparent.When you are finished, remove eval = FALSE
and knit the file to see the changes.
ggplot(data = mn_homes,
mapping = aes(x = lotsize, y = salesprice)) +
geom_point(color = "green", alpha = 0.5) +
labs(title = "Sales price versus lot size for Minneapolis Homes",
x = "Lot Size (square feet)", y = "Sales Price (USD)")
lotsize
.fill = "green"
inside the geom_histogram()
function.color = "red"
inside the geom_histogram()
function.When you are finished, remove eval = FALSE
and knit the file to see the changes.
ggplot(data = mn_homes,
mapping = aes(x = lotsize)) +
geom_histogram(fill = "green", color = "red") +
labs(title = "Histogram of lot size for Minneapolis homes",
x = "Lot size (square feet)", y = "Count")
Question: What is the difference between the color
and fill
arguments?
ggplot(data = mn_homes,
aes(x = yearbuilt, fill = fireplace)) +
geom_density(alpha = 0.50) +
labs(x = "Year Built", y = "", fill = "Fireplace",
title = "Fireplaces through time by community") +
facet_wrap( ~ community)