Main Ideas
- An increasing amount of data is available on the web.
- These data are often in an unstructured format that is tedious to obtain manually.
- Webscraping refers to the process of automating information extraction from unstructured sources and transforming it to a structured dataset.
- We will examine two different methods to scrape the web
- Web APIs (application programming interfaces): useful when a website allows for structured requests that return JSON or XML files
- Screen scraping: extracting data from the source code of a website using an HTML parser or regular expressions
Coming Up
- Homework / lab solutions available on GitHub
- Exam #01 Friday
- Fill out the link here by 2-16 5:00 PM if you are interested in joining a student study group
Lecture Notes and Exercises
We will use the packages below.
library(tidyverse)
library(stringr)
library(robotstxt)
library(rvest)
library(httr)
Option #1: Pulling data using an API
APIs (Application Programming Inferfaces) are software that allow two applications to communicate. APIs exist when website developers make data easily obtainable. The HTTP (hypertext transfer protocol) underlies APIs and the R
package httr
(loaded above) helps us use this tool.
Basically, you send a request to the website you want data from and they send a response.
This is an extremely quick introduction to help you get started. For additional information, check out the additional resources below.
A website with a list of publicly available APIs is here.
The website omdbapi.com makes movie data from the Internet Movie Database (IMDb) available online.
First, enter the API key you obtained before class in my_api_key
.
my_api_key <- "______"
Let’s use the API to pull information from the 1990 Arnold Schwarzenegger classic Total Recall.
We obtain the URL by searching for Total Recall at omdbapi.com.
The default response is JSON. This stands for JavaScript Object Notation, which is a standard data format for APIs.
url <- str_c("http://www.omdbapi.com/?t=Total+Recall&apikey=", my_api_key)
mars <- GET(url)
mars
## Response [http://www.omdbapi.com/?t=Total+Recall&apikey=934b95b4]
## Date: 2021-02-24 12:37
## Status: 200
## Content-Type: application/json; charset=utf-8
## Size: 1.31 kB
details <- content(mars, "parse")
details$Year
## [1] "1990"
details$imdbRating
## [1] "7.5"
details$Plot
## [1] "When a man goes in to have virtual vacation memories of the planet Mars implanted in his mind, an unexpected and harrowing series of events forces him to go to the planet for real - or is he?"
Let’s build a dataset containing information on 1980’s classic action films using the API.
# make a vector of movies
movies <- c("Total+Recall", "Predator", "Commando", "The+Running+Man",
"True+Lies", "Robocop")
# Set up empty tibble
omdb <- tibble(title = character(),
rated = character(),
genre = character(),
actors = character(),
metascore = double(),
imdb_rating = double(),
box_office = double())
# Use for loop to run through API request process 6 times,
# each time filling the next row in the tibble
# - can do max of 1000 GETs per day
for(i in 1:6) {
url <- str_c("http://www.omdbapi.com/?t=", movies[i],
"&apikey=", my_api_key)
onemovie <- GET(url)
details <- content(onemovie, "parse")
omdb[i,1] <- details$Title
omdb[i,2] <- details$Rated
omdb[i,3] <- details$Genre
omdb[i,4] <- details$Actors
omdb[i,5] <- parse_number(details$Metascore)
omdb[i,6] <- parse_number(details$imdbRating)
omdb[i,7] <- parse_number(details$BoxOffice)
}
glimpse(omdb)
## Rows: 6
## Columns: 7
## $ title <chr> "Total Recall", "Predator", "Commando", "The Running Man",…
## $ rated <chr> "R", "R", "R", "R", "R", "R"
## $ genre <chr> "Action, Sci-Fi, Thriller", "Action, Adventure, Sci-Fi, Th…
## $ actors <chr> "Arnold Schwarzenegger, Rachel Ticotin, Sharon Stone, Ronn…
## $ metascore <dbl> 57, 45, 51, 45, 63, 67
## $ imdb_rating <dbl> 7.5, 7.8, 6.7, 6.7, 7.2, 7.5
## $ box_office <dbl> 119412921, 59735548, 35100000, 38122105, 146282411, 534246…
Question: What does parse_number()
do in the code chunk above?
The function parse_number()
drops non-numeric characters from a number, so RoboCop’s box office of “$53,424,681” becomes 53424681.
Option #2: Webscraping using rvest
Unfortunately, not all websites have an API. But we can sometimes acquire data by finding content inside the HTML (hypertext markup language) code used to create web pages and web applications.
HTML describes the structure of a web page. Your browser interprets the structure and contents and displays the results.
The basic building blocks are elements, tags, and attributes.
- An element is a component of an HTML document.
- Elements contain tags (start and end)
- Attributes provide additional information about HTML elements
<a href = "contact.html">Contact us</a>
Say we have access to a simple HTML document like simple.html
. How can we extract information and get it in a structured format suitable for analysis (including visualization, wrangling, etc)?
rvest
The rvest
package makes processing and manipulating HTML data straightforward. It’s designed to work with our standard data-wrangling tools including the pipe %>%
.
The core rvest
functions are provided below. The primary three are provided first.
read_html()
: read HTML data from a URL or character string
html_nodes()
: select specified nodes from HTML document
html_text()
: extract tag pairs’ content
html_node()
: select a specified node from HTML document
html_table()
: parse an HTML table into a data frame
html_name()
: extract tags’ names
html_attrs()
: extract all of each tag’s attributes
html_attr()
: extract tags’ attribute value by name
Remember simple.html
? Let’s read it in using read_html()
and store it in an object named page
.
page <- read_html("simple.html")
page
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n<h1>Using rvest</h1>\n<p>To get started...</p>\n</body>
This looks like an HTML document. It’s a bit closer to something we can use for data analysis but we’re not there yet.
Let’s extract "<h1>Using rvest</h1>"
using html_nodes()
. Start with page
, and then pass page
to html_nodes()
, specifying css = "h1"
. This goes through page and finds all the h1
tags and pulls out those elements.
Examining h1_nodes
shows that we have a node containing "<h1>Using rvest</h1>"
.
h1_nodes <- page %>%
html_nodes(css = "h1")
h1_nodes
## {xml_nodeset (1)}
## [1] <h1>Using rvest</h1>
Now extract the contents (“Using rvest”) and the the tag name (“h1”). We use html_text()
to extract the text from the node use html_name()
to extract “h1”.
h1_nodes %>%
html_text()
## [1] "Using rvest"
h1_nodes %>%
html_name()
## [1] "h1"
So easy! But this was a very simple case. Most HTML documents are quite a bit more complicated. There may be tables, many links, paragraphs of text, etc.
- How do we handle larger HTML documents?
- How do we know what to provide to
html_nodes()
to obtain the desired information in a more realistic example?
- Are the functions in
rvest
vectorized? That is, can we obtain all the content with a particular tag?
In Chrome you can view the HTML document associated with a webpage at “View” -> “Developer” -> “View Source”.
SelectorGadget
SelectorGadget is an open source tool that allows for easy CSS selector generation and discovery. It is easiest to use with a Chrome Extension but you can also add it as a bookmark.
To use SelectorGadget, navigate to a website of interest (we will use the website https://www.imdb.com/), then click the SelectorGadget bookmark. A box will open in the bottom right corner of the website.
Click on a page element. It will turn green and SelectorGadget will generate a minimal CSS selector for that element, and it will highlight in yellow everything matched by that selector.
Click on a yellow highlighted element to remove it from the selector. These will now be highlighted in red. Or click an unhighlighted element to add it to the selector.
Through an iterative process of selection and rejection, SelectorGadget will help you discover the appropriate CSS selector.
Top 250 IMDb Movies
We will scrape information from IMDb.
Let’s first check to see if this is allowed.
paths_allowed("http://www.imdb.com")
##
www.imdb.com
## [1] TRUE
paths_allowed("http://www.facebook.com")
##
www.facebook.com
## [1] FALSE
page_top_movies <- read_html("http://www.imdb.com/chart/top")
titles <- page_top_movies %>%
html_nodes(".titleColumn a") %>%
html_text()
years <- page_top_movies %>%
html_nodes(".secondaryInfo") %>%
html_text() %>%
str_replace("\\(", "") %>% # remove (
str_replace("\\)", "") %>% # remove )
as.numeric()
scores <- page_top_movies %>%
html_nodes("#main strong") %>%
html_text() %>%
as.numeric()
imdb_top_250 <- tibble(
title = titles,
year = years,
score = scores)
Question: What is a limitation of this method of scraping? Hint: consider what will happen if there is a missing value.
Data will often require quite a bit of cleaning. The functions from stringr
will come in handy here.
glimpse(imdb_top_250)
## Rows: 250
## Columns: 3
## $ title <chr> "The Shawshank Redemption", "The Godfather", "The Godfather: Par…
## $ year <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994, 1966, 2001, 1999…
## $ score <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8.8, 8.8, 8.7,…
Let’s add a rank column using mutate.
imdb_top_250 <- imdb_top_250 %>%
mutate(rank = row_number())
Here’s another quick example. The website already has the data in table form, so we can use html_table()
.
url <- "https://www.ssa.gov/oact/babynames/decades/names2000s.html"
paths_allowed(url)
##
www.ssa.gov
## [1] TRUE
top_names <- read_html(url)
tables <- html_nodes(top_names, css = "table")
tables
## {xml_nodeset (1)}
## [1] <table class="t-stripe">\n<thead>\n<div class="fw6 m-pt2 fs3 ta-c">Popula ...
# find the right table
top_names_table <- html_table(tables[[1]], header = TRUE, fill = TRUE)
Practice
- Which 1995 movies are in the top 250 IMDb movies of all time?
imdb_top_250 %>%
filter(year == 1995)
## # A tibble: 8 x 4
## title year score rank
## <chr> <dbl> <dbl> <int>
## 1 Se7en 1995 8.6 20
## 2 The Usual Suspects 1995 8.5 33
## 3 Braveheart 1995 8.3 79
## 4 Toy Story 1995 8.3 81
## 5 Heat 1995 8.2 123
## 6 Casino 1995 8.2 138
## 7 Before Sunrise 1995 8.1 188
## 8 La Haine 1995 8 224
- What years have the most movies on the list?
imdb_top_250 %>%
count(year) %>%
arrange(desc(n)) %>%
slice(1:3)
## # A tibble: 3 x 2
## year n
## <dbl> <int>
## 1 1995 8
## 2 2019 8
## 3 1957 6
- Visualize the average yearly score for movies that made it on the top 250 list over time.
imdb_top_250 %>%
group_by(year) %>%
summarize(mean_score = mean(score)) %>%
ggplot(aes(x = year, y = mean_score)) +
geom_point() +
geom_line() +
labs(title = "IMDb scores over time",
x = "Year", y = "Mean Score") +
theme_bw()
- Modify the code chunk below to scrape the year, title, and rating of the top 100 most popular TV shows.
page_top_shows <- read_html("http://www.imdb.com/chart/tvmeter")
years <- page_top_shows %>%
html_nodes(".secondaryInfo:nth-child(2)") %>%
html_text() %>%
parse_number()
scores <- page_top_shows %>%
html_nodes(".imdbRating") %>%
html_text() %>%
str_trim() %>%
parse_number()
names <- page_top_shows %>%
html_nodes(".titleColumn a") %>%
html_text()
tvshows <- tibble(
rank = 1:100,
year = years,
score = scores,
name = names
)