Spatial Data and Visualization

Song of the Day

Eight Miles High -Husker Du

Main Ideas

Spatial data is important
- exploratory data analysis
- detecting spatial patterns and trends
- understanding spatial data relationships
- analysis of spatial data should reflect spatial structure

Coming Up

Homework #02 due Thursday 2-11 at 11:59 PM
First team lab on Friday 2-12
Complete Lab Team Formation Survey by 2-09 11:59 PM

Hot Keys

Task / function	Windows & Linux	macOS
Insert R chunk	Ctrl+Alt+I	Command+Option+I
Knit document	Ctrl+Shift+K	Command+Shift+K
Run current line	Ctrl+Enter	Command+Enter
Run current chunk	Ctrl+Shift+Enter	Command+Shift+Enter
Run all chunks above	Ctrl+Alt+P	Command+Option+P
`<-`	Alt+-	Option+-
`%>%`	Ctrl+Shift+M	Command+Shift+M

Lecture Notes and Exercises

library(tidyverse)
library(sf)

Spatial data is different.*

Our typical “tidy” dataframe.

mpg

## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 audi         a4         1.8  1999     4 auto(l… f        18    29 p     comp…
##  2 audi         a4         1.8  1999     4 manual… f        21    29 p     comp…
##  3 audi         a4         2    2008     4 manual… f        20    31 p     comp…
##  4 audi         a4         2    2008     4 auto(a… f        21    30 p     comp…
##  5 audi         a4         2.8  1999     6 auto(l… f        16    26 p     comp…
##  6 audi         a4         2.8  1999     6 manual… f        18    26 p     comp…
##  7 audi         a4         3.1  2008     6 auto(a… f        18    27 p     comp…
##  8 audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p     comp…
##  9 audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p     comp…
## 10 audi         a4 quat…   2    2008     4 manual… 4        20    28 p     comp…
## # … with 224 more rows

A new simple feature object.

nc <- st_read("data/nc_covid.shp", quiet = TRUE)
nc

## Simple feature collection with 100 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
## First 10 features:
##       county cases deaths case100 death100                       geometry
## 1   ALAMANCE 14163    173    8355      102 MULTIPOLYGON (((-79.24619 3...
## 2  ALEXANDER  3589     59    9571      157 MULTIPOLYGON (((-81.10889 3...
## 3  ALLEGHANY   844      4    7578       36 MULTIPOLYGON (((-81.23989 3...
## 4      ANSON  2104     44    8607      180 MULTIPOLYGON (((-79.91995 3...
## 5       ASHE  1706     34    6271      125 MULTIPOLYGON (((-81.47276 3...
## 6      AVERY  1690     16    9626       91 MULTIPOLYGON (((-81.94135 3...
## 7   BEAUFORT  3860     72    8214      153 MULTIPOLYGON (((-77.10377 3...
## 8     BERTIE  1545     35    8154      185 MULTIPOLYGON (((-76.78307 3...
## 9     BLADEN  2672     34    8166      104 MULTIPOLYGON (((-78.2615 34...
## 10 BRUNSWICK  6788    102    4753       71 MULTIPOLYGON (((-78.65572 3...

Question: What differences do you observe when comparing a typical tidy data frame to the new simple feature object?

Our typical “tidy” data frame mpg is a tibble where each observation is in a row, each variable is in a column, and each case is in its own cell. The tibble shows the type of each column and the number of rows and columns.

The new spatial data has a few additional elements. We see some information above the dataset “Simple feature collection…”, a “geometry type”, a “dimension”, a bounding box “bbox”, and a coordinate reference system “CRS”. In addition to variables in columns we have a new column geometry where each row as a MULTIPOLYGON label and some confusing-looking coordinates.

Simple features

A simple feature is a standard, formal way to describe how real-world spatial objects (country, building, tree, road, etc) can be represented by a computer.

The package sf implements simple features and other spatial functionality using tidy principles. Simple features have a geometry type. Common choices are shown in the slides associated with today’s lecture.

Simple features are stored in a data frame, with the geographic information in a column called geometry. Simple features can contain both spatial and non-spatial data.

All functions in the sf package helpfully begin st_.

`sf` and `ggplot`

To read simple features from a file or database use the function st_read().

nc <- st_read("data/nc_covid.shp", quiet = TRUE)
nc

## Simple feature collection with 100 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
## First 10 features:
##       county cases deaths case100 death100                       geometry
## 1   ALAMANCE 14163    173    8355      102 MULTIPOLYGON (((-79.24619 3...
## 2  ALEXANDER  3589     59    9571      157 MULTIPOLYGON (((-81.10889 3...
## 3  ALLEGHANY   844      4    7578       36 MULTIPOLYGON (((-81.23989 3...
## 4      ANSON  2104     44    8607      180 MULTIPOLYGON (((-79.91995 3...
## 5       ASHE  1706     34    6271      125 MULTIPOLYGON (((-81.47276 3...
## 6      AVERY  1690     16    9626       91 MULTIPOLYGON (((-81.94135 3...
## 7   BEAUFORT  3860     72    8214      153 MULTIPOLYGON (((-77.10377 3...
## 8     BERTIE  1545     35    8154      185 MULTIPOLYGON (((-76.78307 3...
## 9     BLADEN  2672     34    8166      104 MULTIPOLYGON (((-78.2615 34...
## 10 BRUNSWICK  6788    102    4753       71 MULTIPOLYGON (((-78.65572 3...

Notice nc contains both spatial and nonspatial information.

We can build up a visualization layer-by-layer beginning with ggplot. Let’s start by making a basic plot of North Carolina counties.

ggplot(nc) +
  geom_sf() +
  labs(title = "North Carolina counties")

Now adjust the theme with theme_bw().

ggplot(nc) +
  geom_sf() +
  labs(title = "North Carolina counties with theme") + 
  theme_bw()

Now adjust color in geom_sf to change the color of the county borders.

ggplot(nc) +
  geom_sf(color = "darkgreen") +
  labs(title = "North Carolina counties with theme and aesthetics") + 
  theme_bw()

Then increase the width of the county borders using size.

ggplot(nc) +
  geom_sf(color = "darkgreen", size = 1.5) +
  labs(title = "North Carolina counties with theme and aesthetics") +
  theme_bw()

Fill the counties by specifying a fill argument.

ggplot(nc) +
  geom_sf(color = "darkgreen", size = 1.5, fill = "orange") +
  labs(title = "North Carolina counties with theme and aesthetics") +
  theme_bw()

Finally, adjust the transparency using alpha.

ggplot(nc) +
  geom_sf(color = "darkgreen", size = 1.5, fill = "orange", alpha = 0.50) +
  labs(title = "North Carolina counties with theme and aesthetics") +
  theme_bw()

Our current map is a bit much. Adjust color, size, fill, and alpha until you have a map that effectively displays the counties of North Carolina.

North Carolina COVID-19 Mapping

Now let’s use mapping = aes() to map variables in our dataset to visual properties of the spatial visualization.

The nc data was pulled from the New York Times COVID-19 Dashboard as of 02-04-2021.

The dataset contains the following variables on all North Carolina counties:

county: county name
cases: total number of COVID-19 cases
deaths: total number of COVID-19 deaths
case100: number of COVID-19 cases per 100,000
death100: number of COVID-19 deaths per 100,000

Let’s use the COVID-19 data to generate a choropleth map.

ggplot(nc) +
  geom_sf(aes(fill = cases)) + 
  labs(title = "Higher population counties have more COVID-19 cases",
       fill = "# cases") + 
  theme_bw()

It is best to choose your own color palette. A great resource is colorbrewer2.

One way to set fill colors is with scale_fill_gradient().

ggplot(nc) +
  geom_sf(aes(fill = cases)) +
  scale_fill_gradient(low = "#fee8c8", high = "#7f0000") +
  labs(title = "Higher population counties have more COVID-19 cases",
       fill = "# cases") + 
  theme_bw()

Question: Is the above visualization informative? Why or why not? Try to improve it using the code chunk provided below.

This visualization is not particularly effective. It is basically just showing us counties with high population density.

Let’s improve it by plotting cases per 100,000 residents case100 instead of the number of cases cases.

ggplot(nc) +
  geom_sf(aes(fill = case100)) +
  scale_fill_gradient(low = "#fff7f3", high = "#49006a") +
  labs(title = "COVID-19 cases in North Carolina",
       fill = "cases per 100k") +
  theme_bw()

Challenges

Different types of data exist (raster and vector).
The coordinate reference system (CRS) matters.
Manipulating spatial data objects is similar, but not identical to manipulating data frames.

`dplyr`

The sf package plays nicely with our earlier data wrangling functions from dplyr.

`select()`

nc %>% 
  select(deaths, death100)

## Simple feature collection with 100 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
## First 10 features:
##    deaths death100                       geometry
## 1     173      102 MULTIPOLYGON (((-79.24619 3...
## 2      59      157 MULTIPOLYGON (((-81.10889 3...
## 3       4       36 MULTIPOLYGON (((-81.23989 3...
## 4      44      180 MULTIPOLYGON (((-79.91995 3...
## 5      34      125 MULTIPOLYGON (((-81.47276 3...
## 6      16       91 MULTIPOLYGON (((-81.94135 3...
## 7      72      153 MULTIPOLYGON (((-77.10377 3...
## 8      35      185 MULTIPOLYGON (((-76.78307 3...
## 9      34      104 MULTIPOLYGON (((-78.2615 34...
## 10    102       71 MULTIPOLYGON (((-78.65572 3...

`filter()`

nc %>% 
  filter(deaths > 100)

## Simple feature collection with 34 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -82.88111 ymin: 33.88199 xmax: -76.62562 ymax: 36.56521
## geographic CRS: NAD27
## First 10 features:
##        county cases deaths case100 death100                       geometry
## 1    ALAMANCE 14163    173    8355      102 MULTIPOLYGON (((-79.24619 3...
## 2   BRUNSWICK  6788    102    4753       71 MULTIPOLYGON (((-78.65572 3...
## 3    BUNCOMBE 13774    266    5274      102 MULTIPOLYGON (((-82.2581 35...
## 4       BURKE  8590    112    9493      124 MULTIPOLYGON (((-81.81628 3...
## 5    CABARRUS 16577    201    7658       93 MULTIPOLYGON (((-80.50294 3...
## 6     CATAWBA 15980    246   10016      154 MULTIPOLYGON (((-80.96143 3...
## 7   CLEVELAND  9460    192    9658      196 MULTIPOLYGON (((-81.32282 3...
## 8    COLUMBUS  5277    123    9507      222 MULTIPOLYGON (((-78.65572 3...
## 9      CRAVEN  7322    108    7169      106 MULTIPOLYGON (((-76.89761 3...
## 10 CUMBERLAND 21049    218    6274       65 MULTIPOLYGON (((-78.49929 3...

`summarize()`

We can use summarize() to find the total deaths and cases in North Carolina, but note that the geometry is now meaningless.

nc %>% 
  summarize(total_deaths = sum(deaths),
            total_cases = sum(cases))

## Simple feature collection with 1 feature and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
##   total_deaths total_cases                       geometry
## 1         9625      779727 MULTIPOLYGON (((-77.96073 3...

Geometries are “sticky”. They are kept until deliberately dropped using st_drop_geometry.

nc %>% 
  select(county, deaths) %>% 
  filter(deaths > 100) %>% 
  st_drop_geometry()

##         county deaths
## 1     ALAMANCE    173
## 2    BRUNSWICK    102
## 3     BUNCOMBE    266
## 4        BURKE    112
## 5     CABARRUS    201
## 6      CATAWBA    246
## 7    CLEVELAND    192
## 8     COLUMBUS    123
## 9       CRAVEN    108
## 10  CUMBERLAND    218
## 11    DAVIDSON    131
## 12      DUPLIN    116
## 13      DURHAM    189
## 14     FORSYTH    283
## 15      GASTON    328
## 16    GUILFORD    425
## 17     HARNETT    119
## 18   HENDERSON    127
## 19     IREDELL    163
## 20    JOHNSTON    168
## 21 MECKLENBURG    794
## 22       MOORE    131
## 23        NASH    143
## 24 NEW HANOVER    131
## 25      ONSLOW    115
## 26    RANDOLPH    186
## 27     ROBESON    183
## 28       ROWAN    252
## 29  RUTHERFORD    176
## 30       SURRY    112
## 31       UNION    159
## 32        WAKE    444
## 33       WAYNE    187
## 34      WILSON    141

Practice

Construct an effective visualization investigating the spatial distribution of COVID-19 deaths in North Carolina. Carefully consider aesthetics and choose your own color palette using colorbrewer2.

Below we plot the COVID-19 deaths per 100,000 residents for North Carolina counties.

ggplot(nc) +
  geom_sf(aes(fill = death100)) +
  scale_fill_gradient(low = "#fff7f3", high = "#49006a") +
  labs(title = "COVID-19 deaths in North Carolina",
       fill = "deaths per 100k") +
  theme_bw()

Let’s demonstrate the use of mid in scale_fill_gradient2. The midpoint defaults to 0. Here we specify that the midpoint is the approximate number of cases per 100,000 in the entire United States.

ggplot(nc) +
  geom_sf(aes(fill = case100)) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 8000) +
  labs(title = "COVID-19 cases in North Carolina",
       fill = "cases per 100k") +
  theme_bw()

Which counties have a relatively large number of deaths given their case count? Which counties have a relatively small number of deaths given their case count? Construct an effective visualization to answer this question and carefully consider all aesthetic choices.

ggplot(nc) +
  geom_sf(aes(fill = deaths / cases)) +
  scale_fill_gradient(low = "#fff7f3", high = "#49006a") +
  labs(title = "COVID-19 case fatality ratio in NC",
       fill = "deaths / cases") +
  theme_bw()

What are limitations of your visualizations above?

Additional Resources

Simple features in R
Coordinate references systems
Geographic data in R
Leaflet
- Great resources for advanced spatial visualization