Main Ideas
- Spatial data is important
- exploratory data analysis
- detecting spatial patterns and trends
- understanding spatial data relationships
- analysis of spatial data should reflect spatial structure
Task / function | Windows & Linux | macOS |
---|---|---|
Insert R chunk | Ctrl+Alt+I | Command+Option+I |
Knit document | Ctrl+Shift+K | Command+Shift+K |
Run current line | Ctrl+Enter | Command+Enter |
Run current chunk | Ctrl+Shift+Enter | Command+Shift+Enter |
Run all chunks above | Ctrl+Alt+P | Command+Option+P |
<- |
Alt+- | Option+- |
%>% |
Ctrl+Shift+M | Command+Shift+M |
library(tidyverse)
library(sf)
Spatial data is different.*
Our typical “tidy” dataframe.
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp…
## 3 audi a4 2 2008 4 manual… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp…
## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp…
## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp…
## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp…
## # … with 224 more rows
A new simple feature object.
nc <- st_read("data/nc_covid.shp", quiet = TRUE)
nc
## Simple feature collection with 100 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
## First 10 features:
## county cases deaths case100 death100 geometry
## 1 ALAMANCE 14163 173 8355 102 MULTIPOLYGON (((-79.24619 3...
## 2 ALEXANDER 3589 59 9571 157 MULTIPOLYGON (((-81.10889 3...
## 3 ALLEGHANY 844 4 7578 36 MULTIPOLYGON (((-81.23989 3...
## 4 ANSON 2104 44 8607 180 MULTIPOLYGON (((-79.91995 3...
## 5 ASHE 1706 34 6271 125 MULTIPOLYGON (((-81.47276 3...
## 6 AVERY 1690 16 9626 91 MULTIPOLYGON (((-81.94135 3...
## 7 BEAUFORT 3860 72 8214 153 MULTIPOLYGON (((-77.10377 3...
## 8 BERTIE 1545 35 8154 185 MULTIPOLYGON (((-76.78307 3...
## 9 BLADEN 2672 34 8166 104 MULTIPOLYGON (((-78.2615 34...
## 10 BRUNSWICK 6788 102 4753 71 MULTIPOLYGON (((-78.65572 3...
Question: What differences do you observe when comparing a typical tidy data frame to the new simple feature object?
Our typical “tidy” data frame mpg
is a tibble where each observation is in a row, each variable is in a column, and each case is in its own cell. The tibble shows the type of each column and the number of rows and columns.
The new spatial data has a few additional elements. We see some information above the dataset “Simple feature collection…”, a “geometry type”, a “dimension”, a bounding box “bbox”, and a coordinate reference system “CRS”. In addition to variables in columns we have a new column geometry
where each row as a MULTIPOLYGON label and some confusing-looking coordinates.
A simple feature is a standard, formal way to describe how real-world spatial objects (country, building, tree, road, etc) can be represented by a computer.
The package sf
implements simple features and other spatial functionality using tidy principles. Simple features have a geometry type. Common choices are shown in the slides associated with today’s lecture.
Simple features are stored in a data frame, with the geographic information in a column called geometry
. Simple features can contain both spatial and non-spatial data.
All functions in the sf
package helpfully begin st_
.
sf
and ggplot
To read simple features from a file or database use the function st_read()
.
nc <- st_read("data/nc_covid.shp", quiet = TRUE)
nc
## Simple feature collection with 100 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
## First 10 features:
## county cases deaths case100 death100 geometry
## 1 ALAMANCE 14163 173 8355 102 MULTIPOLYGON (((-79.24619 3...
## 2 ALEXANDER 3589 59 9571 157 MULTIPOLYGON (((-81.10889 3...
## 3 ALLEGHANY 844 4 7578 36 MULTIPOLYGON (((-81.23989 3...
## 4 ANSON 2104 44 8607 180 MULTIPOLYGON (((-79.91995 3...
## 5 ASHE 1706 34 6271 125 MULTIPOLYGON (((-81.47276 3...
## 6 AVERY 1690 16 9626 91 MULTIPOLYGON (((-81.94135 3...
## 7 BEAUFORT 3860 72 8214 153 MULTIPOLYGON (((-77.10377 3...
## 8 BERTIE 1545 35 8154 185 MULTIPOLYGON (((-76.78307 3...
## 9 BLADEN 2672 34 8166 104 MULTIPOLYGON (((-78.2615 34...
## 10 BRUNSWICK 6788 102 4753 71 MULTIPOLYGON (((-78.65572 3...
Notice nc
contains both spatial and nonspatial information.
We can build up a visualization layer-by-layer beginning with ggplot
. Let’s start by making a basic plot of North Carolina counties.
ggplot(nc) +
geom_sf() +
labs(title = "North Carolina counties")
Now adjust the theme with theme_bw()
.
ggplot(nc) +
geom_sf() +
labs(title = "North Carolina counties with theme") +
theme_bw()
Now adjust color
in geom_sf
to change the color of the county borders.
ggplot(nc) +
geom_sf(color = "darkgreen") +
labs(title = "North Carolina counties with theme and aesthetics") +
theme_bw()
Then increase the width of the county borders using size
.
ggplot(nc) +
geom_sf(color = "darkgreen", size = 1.5) +
labs(title = "North Carolina counties with theme and aesthetics") +
theme_bw()
Fill the counties by specifying a fill
argument.
ggplot(nc) +
geom_sf(color = "darkgreen", size = 1.5, fill = "orange") +
labs(title = "North Carolina counties with theme and aesthetics") +
theme_bw()
Finally, adjust the transparency using alpha
.
ggplot(nc) +
geom_sf(color = "darkgreen", size = 1.5, fill = "orange", alpha = 0.50) +
labs(title = "North Carolina counties with theme and aesthetics") +
theme_bw()
Our current map is a bit much. Adjust color
, size
, fill
, and alpha
until you have a map that effectively displays the counties of North Carolina.
Now let’s use mapping = aes()
to map variables in our dataset to visual properties of the spatial visualization.
The nc
data was pulled from the New York Times COVID-19 Dashboard as of 02-04-2021.
The dataset contains the following variables on all North Carolina counties:
county
: county namecases
: total number of COVID-19 casesdeaths
: total number of COVID-19 deathscase100
: number of COVID-19 cases per 100,000death100
: number of COVID-19 deaths per 100,000Let’s use the COVID-19 data to generate a choropleth map.
ggplot(nc) +
geom_sf(aes(fill = cases)) +
labs(title = "Higher population counties have more COVID-19 cases",
fill = "# cases") +
theme_bw()
It is best to choose your own color palette. A great resource is colorbrewer2.
One way to set fill colors is with scale_fill_gradient()
.
ggplot(nc) +
geom_sf(aes(fill = cases)) +
scale_fill_gradient(low = "#fee8c8", high = "#7f0000") +
labs(title = "Higher population counties have more COVID-19 cases",
fill = "# cases") +
theme_bw()
Question: Is the above visualization informative? Why or why not? Try to improve it using the code chunk provided below.
This visualization is not particularly effective. It is basically just showing us counties with high population density.
Let’s improve it by plotting cases per 100,000 residents case100
instead of the number of cases cases
.
ggplot(nc) +
geom_sf(aes(fill = case100)) +
scale_fill_gradient(low = "#fff7f3", high = "#49006a") +
labs(title = "COVID-19 cases in North Carolina",
fill = "cases per 100k") +
theme_bw()
Different types of data exist (raster and vector).
The coordinate reference system (CRS) matters.
Manipulating spatial data objects is similar, but not identical to manipulating data frames.
dplyr
The sf
package plays nicely with our earlier data wrangling functions from dplyr
.
select()
nc %>%
select(deaths, death100)
## Simple feature collection with 100 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
## First 10 features:
## deaths death100 geometry
## 1 173 102 MULTIPOLYGON (((-79.24619 3...
## 2 59 157 MULTIPOLYGON (((-81.10889 3...
## 3 4 36 MULTIPOLYGON (((-81.23989 3...
## 4 44 180 MULTIPOLYGON (((-79.91995 3...
## 5 34 125 MULTIPOLYGON (((-81.47276 3...
## 6 16 91 MULTIPOLYGON (((-81.94135 3...
## 7 72 153 MULTIPOLYGON (((-77.10377 3...
## 8 35 185 MULTIPOLYGON (((-76.78307 3...
## 9 34 104 MULTIPOLYGON (((-78.2615 34...
## 10 102 71 MULTIPOLYGON (((-78.65572 3...
filter()
nc %>%
filter(deaths > 100)
## Simple feature collection with 34 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -82.88111 ymin: 33.88199 xmax: -76.62562 ymax: 36.56521
## geographic CRS: NAD27
## First 10 features:
## county cases deaths case100 death100 geometry
## 1 ALAMANCE 14163 173 8355 102 MULTIPOLYGON (((-79.24619 3...
## 2 BRUNSWICK 6788 102 4753 71 MULTIPOLYGON (((-78.65572 3...
## 3 BUNCOMBE 13774 266 5274 102 MULTIPOLYGON (((-82.2581 35...
## 4 BURKE 8590 112 9493 124 MULTIPOLYGON (((-81.81628 3...
## 5 CABARRUS 16577 201 7658 93 MULTIPOLYGON (((-80.50294 3...
## 6 CATAWBA 15980 246 10016 154 MULTIPOLYGON (((-80.96143 3...
## 7 CLEVELAND 9460 192 9658 196 MULTIPOLYGON (((-81.32282 3...
## 8 COLUMBUS 5277 123 9507 222 MULTIPOLYGON (((-78.65572 3...
## 9 CRAVEN 7322 108 7169 106 MULTIPOLYGON (((-76.89761 3...
## 10 CUMBERLAND 21049 218 6274 65 MULTIPOLYGON (((-78.49929 3...
summarize()
We can use summarize()
to find the total deaths and cases in North Carolina, but note that the geometry
is now meaningless.
nc %>%
summarize(total_deaths = sum(deaths),
total_cases = sum(cases))
## Simple feature collection with 1 feature and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
## total_deaths total_cases geometry
## 1 9625 779727 MULTIPOLYGON (((-77.96073 3...
Geometries are “sticky”. They are kept until deliberately dropped using st_drop_geometry
.
nc %>%
select(county, deaths) %>%
filter(deaths > 100) %>%
st_drop_geometry()
## county deaths
## 1 ALAMANCE 173
## 2 BRUNSWICK 102
## 3 BUNCOMBE 266
## 4 BURKE 112
## 5 CABARRUS 201
## 6 CATAWBA 246
## 7 CLEVELAND 192
## 8 COLUMBUS 123
## 9 CRAVEN 108
## 10 CUMBERLAND 218
## 11 DAVIDSON 131
## 12 DUPLIN 116
## 13 DURHAM 189
## 14 FORSYTH 283
## 15 GASTON 328
## 16 GUILFORD 425
## 17 HARNETT 119
## 18 HENDERSON 127
## 19 IREDELL 163
## 20 JOHNSTON 168
## 21 MECKLENBURG 794
## 22 MOORE 131
## 23 NASH 143
## 24 NEW HANOVER 131
## 25 ONSLOW 115
## 26 RANDOLPH 186
## 27 ROBESON 183
## 28 ROWAN 252
## 29 RUTHERFORD 176
## 30 SURRY 112
## 31 UNION 159
## 32 WAKE 444
## 33 WAYNE 187
## 34 WILSON 141
Below we plot the COVID-19 deaths per 100,000 residents for North Carolina counties.
ggplot(nc) +
geom_sf(aes(fill = death100)) +
scale_fill_gradient(low = "#fff7f3", high = "#49006a") +
labs(title = "COVID-19 deaths in North Carolina",
fill = "deaths per 100k") +
theme_bw()
Let’s demonstrate the use of mid
in scale_fill_gradient2
. The midpoint
defaults to 0. Here we specify that the midpoint is the approximate number of cases per 100,000 in the entire United States.
ggplot(nc) +
geom_sf(aes(fill = case100)) +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 8000) +
labs(title = "COVID-19 cases in North Carolina",
fill = "cases per 100k") +
theme_bw()
ggplot(nc) +
geom_sf(aes(fill = deaths / cases)) +
scale_fill_gradient(low = "#fff7f3", high = "#49006a") +
labs(title = "COVID-19 case fatality ratio in NC",
fill = "deaths / cases") +
theme_bw()
R
R