String Manipulation

Song of the Day

It’s All Coming Back to Me Now -Meat Loaf

Main Ideas

Working with string data is essential for a number of data science tasks, including data cleaning, data preparation, and text analysis.
The stringr package in R (part of the tidyverse) contains useful tools for working with character strings.

Coming Up

Exam #01 next week
Homework #02 due Thursday 2-11 at 11:59 PM
“y versus x”
Study group formation
First team lab on Friday 2-12
- lab team assignments have been posted to Sakai
- make sure you have the most recent version of Zoom before lab
- make sure you know your team number and who is on your team

Lecture Notes and Exercises

In addition to the tidyverse, we will use the stringr package.

library(tidyverse)
library(stringr)

stringr provides tools to work with character strings. Functions in stringr have consistent, memorable names.

All begin with str_ (str_count(), str_detect(), str_trim(), etc).
All take a vector of strings as their first arguments.
We only have time to explore the basics. I encourage you to explore on your own using the additional resources below.

Preliminaries

Character strings in R are defined by double quotation marks. These can include numbers, letters, punctation, whitespace, etc.

string1 <- "STA 199 is my favorite class"
string1

## [1] "STA 199 is my favorite class"

You can combine character strings in a vector.

string2 <- c("STA 199", "Data Science", "Duke")
string2

## [1] "STA 199"      "Data Science" "Duke"

Question: What if we want to include a quotation in a string? Why doesn’t the code below work?

string3 <- "I said "Hello" to my class"

To include a double quote in a string escape it using a backslash. Try it now in the code chunk below and name your string string4.

string4 <- "I said \"Hello\" to my class"

If you want to include an actual backslash, escape it as shown below. This may seem tedious but it will be important later.

string5 <- "\\"

The function writeLines() shows the content of the strings not including escapes. Try it for string1, string2, string3, string4, and string5 in the code chunk below.

writeLines(string1)

## STA 199 is my favorite class

writeLines(string2)

## STA 199
## Data Science
## Duke

writeLines(string4)

## I said "Hello" to my class

writeLines(string5)

## \

U.S. States

To demonstrate the basic functions from stringr we will use a vector of all 50 U.S. states.

states <- c("alabama", "alaska", "arizona", "arkansas", "california", 
            "colorado", "connecticut", "delaware", "florida", "georgia", 
            "hawaii", "idaho", "illinois", "indiana", "iowa", "kansas", 
            "kentucky", "louisiana", "maine", "maryland", "massachusetts", 
            "michigan", "minnesota", "mississippi", "missouri", "montana", 
            "nebraska", "nevada", "new hampshire", "new jersey", 
            "new mexico", "new york", "north carolina", "north dakota", "ohio", 
            "oklahoma", "oregon", "pennsylvania", "rhode island",
            "south carolina", "south dakota", "tennessee", "texas", "utah", 
            "vermont", "virginia", "washington", "west virginia", "wisconsin",
            "wyoming")

`str_length()`

Given a string, return the number of characters.

string1

## [1] "STA 199 is my favorite class"

str_length(string1)

## [1] 28

Given a vector of strings, return the number of characters in each string.

str_length(states)

##  [1]  7  6  7  8 10  8 11  8  7  7  6  5  8  7  4  6  8  9  5  8 13  8  9 11  8
## [26]  7  8  6 13 10 10  8 14 12  4  8  6 12 12 14 12  9  5  4  7  8 10 13  9  7

`str_c()`

Combine two (or more) strings.

str_c("STA 199", "is", "my", "favorite", "class")

## [1] "STA 199ismyfavoriteclass"

Use sep to specify how the strings are separated.

str_c("STA 199", "is", "my", "favorite", "class", sep = " ")

## [1] "STA 199 is my favorite class"

`str_to_lower()` and `str_to_upper()`

Convert the case of a string from lower to upper or vice versa.

str_to_upper(states)

##  [1] "ALABAMA"        "ALASKA"         "ARIZONA"        "ARKANSAS"      
##  [5] "CALIFORNIA"     "COLORADO"       "CONNECTICUT"    "DELAWARE"      
##  [9] "FLORIDA"        "GEORGIA"        "HAWAII"         "IDAHO"         
## [13] "ILLINOIS"       "INDIANA"        "IOWA"           "KANSAS"        
## [17] "KENTUCKY"       "LOUISIANA"      "MAINE"          "MARYLAND"      
## [21] "MASSACHUSETTS"  "MICHIGAN"       "MINNESOTA"      "MISSISSIPPI"   
## [25] "MISSOURI"       "MONTANA"        "NEBRASKA"       "NEVADA"        
## [29] "NEW HAMPSHIRE"  "NEW JERSEY"     "NEW MEXICO"     "NEW YORK"      
## [33] "NORTH CAROLINA" "NORTH DAKOTA"   "OHIO"           "OKLAHOMA"      
## [37] "OREGON"         "PENNSYLVANIA"   "RHODE ISLAND"   "SOUTH CAROLINA"
## [41] "SOUTH DAKOTA"   "TENNESSEE"      "TEXAS"          "UTAH"          
## [45] "VERMONT"        "VIRGINIA"       "WASHINGTON"     "WEST VIRGINIA" 
## [49] "WISCONSIN"      "WYOMING"

`str_sub()`

Extract parts of a string from start to end, inclusive.

str_sub(states, 1, 4)

##  [1] "alab" "alas" "ariz" "arka" "cali" "colo" "conn" "dela" "flor" "geor"
## [11] "hawa" "idah" "illi" "indi" "iowa" "kans" "kent" "loui" "main" "mary"
## [21] "mass" "mich" "minn" "miss" "miss" "mont" "nebr" "neva" "new " "new "
## [31] "new " "new " "nort" "nort" "ohio" "okla" "oreg" "penn" "rhod" "sout"
## [41] "sout" "tenn" "texa" "utah" "verm" "virg" "wash" "west" "wisc" "wyom"

str_sub(states, -4, -1)

##  [1] "bama" "aska" "zona" "nsas" "rnia" "rado" "icut" "ware" "rida" "rgia"
## [11] "waii" "daho" "nois" "iana" "iowa" "nsas" "ucky" "iana" "aine" "land"
## [21] "etts" "igan" "sota" "ippi" "ouri" "tana" "aska" "vada" "hire" "rsey"
## [31] "xico" "york" "lina" "kota" "ohio" "homa" "egon" "ania" "land" "lina"
## [41] "kota" "ssee" "exas" "utah" "mont" "inia" "gton" "inia" "nsin" "ming"

Practice: Combine str_sub() and str_to_upper() to capitalize each state (you can ignore two word states).

str_sub(states, 1, 1) <- str_to_upper(str_sub(states, 1, 1))
states

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New hampshire"  "New jersey"     "New mexico"     "New york"      
## [33] "North carolina" "North dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode island"   "South carolina"
## [41] "South dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West virginia" 
## [49] "Wisconsin"      "Wyoming"

`str_sort()`

Sort a string. Below we sort in decreasing alphabetical order.

str_sort(states, decreasing = TRUE)

##  [1] "Wyoming"        "Wisconsin"      "West virginia"  "Washington"    
##  [5] "Virginia"       "Vermont"        "Utah"           "Texas"         
##  [9] "Tennessee"      "South dakota"   "South carolina" "Rhode island"  
## [13] "Pennsylvania"   "Oregon"         "Oklahoma"       "Ohio"          
## [17] "North dakota"   "North carolina" "New york"       "New mexico"    
## [21] "New jersey"     "New hampshire"  "Nevada"         "Nebraska"      
## [25] "Montana"        "Missouri"       "Mississippi"    "Minnesota"     
## [29] "Michigan"       "Massachusetts"  "Maryland"       "Maine"         
## [33] "Louisiana"      "Kentucky"       "Kansas"         "Iowa"          
## [37] "Indiana"        "Illinois"       "Idaho"          "Hawaii"        
## [41] "Georgia"        "Florida"        "Delaware"       "Connecticut"   
## [45] "Colorado"       "California"     "Arkansas"       "Arizona"       
## [49] "Alaska"         "Alabama"

Regular Expressions

A regular expression is a sequence of characters that allows you to describe string patterns. We use them to search for patterns.

Examples of usage include the following data science tasks:

extract a phone number from text data
determine if an email address is valid
determine if a password has some specified number of letters, characters, numbers, etc
count the number of times “statistics” occurs in a corpus of text

To demonstrate regular expressions, we will use a vector of the states bordering North Carolina.

nc_states <- c("North Carolina", "South Carolina", "Virginia", "Tennessee", 
               "Georgia")

Basic Match

We can match exactly using a basic match.

str_view_all(nc_states, "in")

We can match any character using .

str_view_all(nc_states, ".a")

Question: What if we want to match a period .?

Escape it using . This is the regular expression.

But we represent regular expressions using strings and is also an escape symbol in strings.

Escape again!

To create the regular expression ., use the string “\.”

str_view_all(c("a.c", "abc", "def"), "a\\.c")

Anchors

Match the start of a string using ^.

str_view_all(nc_states, "^G")

Match the end of a string using $.

str_view_all(nc_states, "a$")

`str_detect()`

Determine if a character vector matches a pattern.

nc_states

## [1] "North Carolina" "South Carolina" "Virginia"       "Tennessee"     
## [5] "Georgia"

str_detect(nc_states, "a")

## [1]  TRUE  TRUE  TRUE FALSE  TRUE

`str_subset()`

nc_states

## [1] "North Carolina" "South Carolina" "Virginia"       "Tennessee"     
## [5] "Georgia"

str_subset(nc_states, "e$")

## [1] "Tennessee"

`str_count()`

Determine how many matches there are in a string.

nc_states

## [1] "North Carolina" "South Carolina" "Virginia"       "Tennessee"     
## [5] "Georgia"

str_count(nc_states, "a")

## [1] 2 2 1 0 1

`str_replace()` and `str_replace_all()`

Replace matches with new strings.

str_replace(nc_states, "a", "-")

## [1] "North C-rolina" "South C-rolina" "Virgini-"       "Tennessee"     
## [5] "Georgi-"

Use str_replace_all() to replace all matches with new strings.

str_replace_all(nc_states, "a", "-")

## [1] "North C-rolin-" "South C-rolin-" "Virgini-"       "Tennessee"     
## [5] "Georgi-"

Many Matches

The regular expressions below match more than one character.

Match any digit using \d or [[:digit:]]
Match any whitespace using \s or [[:space:]]
Match f, g, or h using [fgh]
Match anything but f, g, or h using [^fgh]
Match lower-case letters using [a-z] or [[:lower:]]
Match upper-case letters using [A-Z] or [[:upper:]]
Match alphabetic characters using [A-z] or [[:alpha:]]

Remember these are regular expressions! To match digits you’ll need to escape the , so use “\d”, not "

Practice

To practice manipulating strings we will use question and answer data from two recent seasons (2008 - 2009) of the television game show Jeopardy!.

jeopardy <- read_csv("data/questions.csv")

category: category of question
value: value of question in dollars
question: text of question
answer: text of question answer
year: year episode aired

glimpse(jeopardy)

## Rows: 40,865
## Columns: 5
## $ category <chr> "OLD FOLKS IN THEIR 30s", "MOVIES & TV", "A STATE OF COLLEGE…
## $ value    <dbl> 200, 200, 200, 200, 200, 200, 400, 400, 400, 400, 400, 400, …
## $ question <chr> "goop.com is a lifestyles website from this Oscar-winning ac…
## $ answer   <chr> "Gwyneth Paltrow", "Jay Leno", "Texas", "a pride", "a bunny …
## $ year     <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, …

Use a single code pipeline and a function from stringr to return all rows where the answer contains the word “Durham”

jeopardy %>%
  filter(str_detect(answer, "Durham"))

## # A tibble: 3 x 5
##   category     value question                                   answer      year
##   <chr>        <dbl> <chr>                                      <chr>      <dbl>
## 1 BULL          2000 "\"Bull City\", this place's nickname, is… Durham      2009
## 2 BASEBRAWL     1000 "In 1995 10 players were ejected for a br… the Durha…  2009
## 3 MOVIES BY Q…   800 "Crash: \"Man, that ball got out of here … Bull Durh…  2009

Use a single code pipeline and stringr to find the length of all of the answers, sort by decreasing length, and return the five longest answers.

jeopardy %>%
  mutate(answer_length = str_length(answer)) %>%
  arrange(desc(answer_length)) %>%
  select(answer, answer_length) %>% 
  slice(1:5)

## # A tibble: 5 x 2
##   answer                                                           answer_length
##   <chr>                                                                    <int>
## 1 a microphone & the masks of comedy & tragedy (a TV set, a movie…            86
## 2 hiding your light under a bushel (keeping your light underneath…            82
## 3 International Talk Like a Pirate Day (National Talk Like a Pira…            79
## 4 (any of) the (St. Louis) Rams, the Oakland Raiders, or the San …            77
## 5 to take the number that's between 3 and 5 (averaging the 2 midd…            74

What answer has the most digits?

jeopardy %>% 
  mutate(answer_digits = str_count(answer, "\\d")) %>%
  arrange(desc(answer_digits)) %>%
  select(answer, answer_digits) %>%
  slice(1:3)

## # A tibble: 3 x 2
##   answer         answer_digits
##   <chr>                  <int>
## 1 1939 (or 1942)             8
## 2 1952 & 1956                8
## 3 867-5309                   7

Return all rows where the category has a period.

jeopardy %>%
  filter(str_detect(category, "\\."))

## # A tibble: 1,249 x 5
##    category      value question                                answer       year
##    <chr>         <dbl> <chr>                                   <chr>       <dbl>
##  1 I LOVE L.A. …   400 "Kobe called it \"idiotic criticism\" … Shaquille …  2009
##  2 I LOVE L.A. …   800 "A wizard at passing the ball, this La… Magic John…  2009
##  3 I LOVE L.A. …  1200 "This Laker giant was nicknamed \"The … Wilt Chamb…  2009
##  4 I LOVE L.A. …  1600 "This Hall-of-Fame guard & former Lake… Jerry West   2009
##  5 I LOVE L.A. …  2000 "This flashy Lakers forward was nickna… James Wort…  2009
##  6 IT'S AN L.A.…   200 "Wanna live in this city, 90210? in Ju… Beverly Hi…  2009
##  7 IT'S AN L.A.…   400 "Originally the letters in this landma… the Hollyw…  2009
##  8 IT'S AN L.A.…   600 "Good times are Bruin in this district… Westwood     2009
##  9 IT'S AN L.A.…   800 "You can hit the Comedy Store, House o… Sunset Str…  2009
## 10 IT'S AN L.A.…  1000 "Originally called \"Nuestro Pueblo\" … the Watts …  2009
## # … with 1,239 more rows

Using a single code pipeline, return all rows where the question contains a (numeric) year between 1800 and 1999

jeopardy %>%
  filter(str_detect(question, "1[89]\\d\\d")) %>%
  select(question)

## # A tibble: 6,749 x 1
##    question                                                                     
##    <chr>                                                                        
##  1 "During the War Of 1812, this \"Rip Van Winkle\" author wrote biographies of…
##  2 "(<a href=\"http://www.j-archive.com/media/2009-05-08_DJ_28.jpg\" target=\"_…
##  3 "He reviewed films & TV for the New Republic before his first book, \"Goodby…
##  4 "While he was in Spain in 1959, he wrote \"The Dangerous Summer\", a story a…
##  5 "In 1884 she moved to Red Cloud, Nebraska & later fictionalized it as the to…
##  6 "1980: \"Regular Folks\""                                                    
##  7 "In 1986 Mexico scored as the first country to host this international sport…
##  8 "1932: \"Magnificent Inn\""                                                  
##  9 "1976: \"A Single Colorado Mountain\""                                       
## 10 "1954: \"Dockside\""                                                         
## # … with 6,739 more rows

Using a single code pipeline, return all rows with answers that begin with three vowels.

jeopardy %>%
  filter(str_detect(answer, "^[AEIOUaeiou][AEIOUaeiou][AEIOUaeiou]")) %>%
  select(answer)

## # A tibble: 7 x 1
##   answer   
##   <chr>    
## 1 Ouija    
## 2 AAA      
## 3 Aeolus   
## 4 Aeon Flux
## 5 Aeolus   
## 6 aioli    
## 7 Ouija

Using a single code pipeline, return all answers that end with ugh but not ough.

jeopardy %>%
  filter(str_detect(answer, "[^o]ugh$")) %>%
  select(answer)

## # A tibble: 5 x 1
##   answer         
##   <chr>          
## 1 (Rush) Limbaugh
## 2 laugh          
## 3 Evelyn Waugh   
## 4 Rush Limbaugh  
## 5 Rush Limbaugh

Use a single code pipeline to create a new variable prop_vowel that is the proportion of all letters in each answer that are vowels. What is the highest? Lowest?

jeopardy %>%
  mutate(vowels = str_count(answer, "[AEIOUaeiou]"),
         letters = str_count(answer, "[[:alpha:]]"),
         prop_vowel = vowels / letters) %>%
  select(answer, vowels, letters, prop_vowel) %>%
  arrange(desc(prop_vowel)) %>%
  filter(letters > 5, 
         !is.na(prop_vowel)) %>%
  slice(1:3, (n() - 2):n())

## # A tibble: 6 x 4
##   answer         vowels letters prop_vowel
##   <chr>           <int>   <int>      <dbl>
## 1 a queue             5       6      0.833
## 2 a lei & a lee       6       8      0.75 
## 3 queue / cue         6       8      0.75 
## 4 Lynyrd Skynyrd      0      13      0    
## 5 Lynyrd Skynyrd      0      13      0    
## 6 rhythms             0       7      0

String Manipulation

2-11-2020

Song of the Day

Main Ideas

Coming Up

Lecture Notes and Exercises

Preliminaries

U.S. States

`str_length()`

`str_c()`

`str_to_lower()` and `str_to_upper()`

`str_sub()`

`str_sort()`

Regular Expressions

Basic Match

Anchors

`str_detect()`

`str_subset()`

`str_count()`

`str_replace()` and `str_replace_all()`

Many Matches

Practice

Additional Resources

String Manipulation

2-11-2020

Song of the Day

Main Ideas

Coming Up

Lecture Notes and Exercises

Preliminaries

U.S. States

str_length()

str_c()

str_to_lower() and str_to_upper()

str_sub()

str_sort()

Regular Expressions

Basic Match

Anchors

str_detect()

str_subset()

str_count()

str_replace() and str_replace_all()

Many Matches

Practice

Additional Resources

`str_length()`

`str_c()`

`str_to_lower()` and `str_to_upper()`

`str_sub()`

`str_sort()`

`str_detect()`

`str_subset()`

`str_count()`

`str_replace()` and `str_replace_all()`