Lecture 21: Intro to regular expressions

Last time: scraping and wrangling Taskmaster data

What we ultimately want:

   Task  Description     episode episode_name air_date contestant score series
 1 1     Prize: Best th… 1       "It's not y… 18 Marc… Charlotte… 1         11
 2 1     Prize: Best th… 1       "It's not y… 18 Marc… Jamali Ma… 2         11
 3 1     Prize: Best th… 1       "It's not y… 18 Marc… Lee Mack   4         11
 4 1     Prize: Best th… 1       "It's not y… 18 Marc… Mike Wozn… 5         11
 5 1     Prize: Best th… 1       "It's not y… 18 Marc… Sarah Ken… 3         11
 6 2     Do the most im… 1       "It's not y… 18 Marc… Charlotte… 2         11
 7 2     Do the most im… 1       "It's not y… 18 Marc… Jamali Ma… 3         11
 8 2     Do the most im… 1       "It's not y… 18 Marc… Lee Mack   3         11
 9 2     Do the most im… 1       "It's not y… 18 Marc… Mike Wozn… 5         11
10 2     Do the most im… 1       "It's not y… 18 Marc… Sarah Ken… 4         11

colnames: Task, Description, episode, episode_name, air_date, contestant, score, series

Last time: scraping and wrangling Taskmaster data

results <- read_html("https://taskmaster.fandom.com/wiki/Series_11") |>
  html_element(".tmtable") |> 
  html_table() |>
  mutate(episode = ifelse(startsWith(Task, "Episode"), Task, NA)) |>
  fill(episode, .direction = "down") |>
  filter(!startsWith(Task, "Episode"), 
         !(Task %in% c("Total", "Grand Total"))) |>
  pivot_longer(cols = -c(Task, Description, episode),
               names_to = "contestant",
               values_to = "score") |>
  mutate(series = 11)

What we have so far

   Task  Description         episode   contestant score series
  1     Prize: Best thing…  Episode 1… Charlotte… 1         11
  1     Prize: Best thing…  Episode 1… Jamali Ma… 2         11
  1     Prize: Best thing…  Episode 1… Lee Mack   4         11
  1     Prize: Best thing…  Episode 1… Mike Wozn… 5         11
  1     Prize: Best thing…  Episode 1… Sarah Ken… 3         11
  2     Do the most…        Episode 1… Charlotte… 2         11
  2     Do the most…        Episode 1… Jamali Ma… 3[1]      11
  2     Do the most…        Episode 1… Lee Mack   3         11
  2     Do the most…        Episode 1… Mike Wozn… 5         11
  2     Do the most…        Episode 1… Sarah Ken… 4         11

Currently, the episode column contains entries like

"Episode 1: It's not your fault. (18 March 2021)"

Next steps

  1. Separate episode info into episode number, episode name, and air date columns
  2. Clean up the score column
  3. Combine data from multiple series

Goal for today: start learning some tools for 1. and 2.

Cleaning the score column

table(results$score)

   –    ✔    ✘    0    1    2    3 3[1] 3[2]    4 4[2]    5   DQ 
   7    1    1   11   37   42   48    1    3   50    1   55   13 

How do we want to clean these scores? How should the scores be stored?

Extracting numeric information

Suppose we have the following string:

"3[1]"

And we want to extract just the number “3”:

str_extract("3[1]", "3")
[1] "3"

Extracting numeric information

Suppose we have the following string:

"3[1]"

What if we don’t know which number to extract?

str_extract("3[1]", "\\d")
[1] "3"
str_extract("4[1]", "\\d")
[1] "4"
str_extract("DQ", "\\d")
[1] NA

Regular expressions

A regular expression is a pattern used to find matches in text.

The simplest regular expressions match a specific character or sequence of characters:

str_extract("My cat is 3 years old", "cat")
[1] "cat"
str_extract("My cat is 3 years old", "3")
[1] "3"

Matching multiple options

We can also provide multiple options for the match

str_extract("My cat is 3 years old", "cat|dog")
[1] "cat"
str_extract("My dog is 10 years old", "cat|dog")
[1] "dog"
str_extract("My dog is 10 years old, my cat is 3 years old", 
            "cat|dog")
[1] "dog"
str_extract_all("My dog is 10 years old, my cat is 3 years old", 
                "cat|dog")
[[1]]
[1] "dog" "cat"

Matching groups of characters

What if I want to extract a number?

str_extract("My cat is 3 years old", "\\d")
[1] "3"

What do you think will happen when I run the following code?

str_extract("My dog is 10 years old", "\\d")

Matching groups of characters

What if I want to extract a number?

str_extract("My cat is 3 years old", "\\d")
[1] "3"

What do you think will happen when I run the following code?

str_extract("My dog is 10 years old", "\\d")
[1] "1"

Matching groups of characters

The + symbol in a regular expression means “repeated one or more times”

str_extract("My dog is 10 years old", "\\d+")
[1] "10"

Extracting from multiple strings

strings <- c("My cat is 3 years old", "My dog is 10 years old")
str_extract(strings, "\\d+")
[1] "3"  "10"

Extracting episode information

Currently, the episode column contains entries like:

"Episode 2: The pie whisperer. (4 August 2015)"

How would I extract just the episode number?

Extracting episode information

Currently, the episode column contains entries like:

"Episode 2: The pie whisperer. (4 August 2015)"

How would I extract just the episode number?

str_extract("Episode 2: The pie whisperer. (4 August 2015)", "\\d+")
[1] "2"

Extracting episode information

Currently, the episode column contains entries like:

"Episode 2: The pie whisperer. (4 August 2015)"

How would I extract the episode name?

Extracting episode information

"Episode 2: The pie whisperer. (4 August 2015)"

Pattern to match: anything that starts with a :, ends with a .

Note: The . character in a regex means “any character”

str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".")
[1] "E"
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".+")
[1] "Episode 2: The pie whisperer. (4 August 2015)"

Extracting episode information

Note: The . character in a regex means “any character”

str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".")
[1] "E"

We use an escape character when we actually want to choose a period:

str_extract("Episode 2: The pie whisperer. (4 August 2015)", "\\.")
[1] "."

Extracting episode information

Getting everything between the : and the .

str_extract("Episode 2: The pie whisperer. (4 August 2015)", 
            ":.+\\.")
[1] ": The pie whisperer."

Extracting episode information

Getting everything between the : and the .

str_extract("Episode 2: The pie whisperer. (4 August 2015)", 
            "(?<=: ).+(?=\\.)")
[1] "The pie whisperer"

Lookbehinds

(?<=) is a positive lookbehind. It is used to identify expressions which are preceded by a particular expression.

str_extract("Episode 2: The pie whisperer. (4 August 2015)", 
            "(?<=: ).+")
[1] "The pie whisperer. (4 August 2015)"
str_extract("Episode 2: The pie whisperer. (4 August 2015)", 
            "(?<=\\. ).+")
[1] "(4 August 2015)"

Lookaheads

(?=) is a positive lookahead. It is used to identify expressions which are followed by a particular expression.

str_extract("Episode 2: The pie whisperer. (4 August 2015)", 
            ".+(?=\\.)")
[1] "Episode 2: The pie whisperer"
str_extract("Episode 2: The pie whisperer. (4 August 2015)", 
            ".+(?=:)")
[1] "Episode 2"

Extracting air date

I want to extract just the air date. What pattern do I want to match?

str_extract("Episode 2: The pie whisperer. (4 August 2015)", )

Extracting air date

str_extract("Episode 2: The pie whisperer. (4 August 2015)", 
            "(?<=\\().+(?=\\))")

Wrangling the episode info

Currently:

# A tibble: 270 × 1
   episode                                        
   <chr>                                          
 1 Episode 1: It's not your fault. (18 March 2021)
 2 Episode 1: It's not your fault. (18 March 2021)
 3 Episode 1: It's not your fault. (18 March 2021)
 4 Episode 1: It's not your fault. (18 March 2021)
 5 Episode 1: It's not your fault. (18 March 2021)
 6 Episode 1: It's not your fault. (18 March 2021)
 7 Episode 1: It's not your fault. (18 March 2021)
 8 Episode 1: It's not your fault. (18 March 2021)
 9 Episode 1: It's not your fault. (18 March 2021)
10 Episode 1: It's not your fault. (18 March 2021)
# ℹ 260 more rows

Wrangling the episode info

One option:

results |>
  mutate(episode_name = str_extract(episode,
                                    "(?<=: ).+(?=\\.)"),
         air_date = str_extract(episode, "(?<=\\().+(?=\\))"),
         episode = str_extract(episode, "\\d+"))
# A tibble: 270 × 3
   episode episode_name        air_date     
   <chr>   <chr>               <chr>        
 1 1       It's not your fault 18 March 2021
 2 1       It's not your fault 18 March 2021
 3 1       It's not your fault 18 March 2021
 4 1       It's not your fault 18 March 2021
 5 1       It's not your fault 18 March 2021
 6 1       It's not your fault 18 March 2021
 7 1       It's not your fault 18 March 2021
 8 1       It's not your fault 18 March 2021
 9 1       It's not your fault 18 March 2021
10 1       It's not your fault 18 March 2021
# ℹ 260 more rows

Wrangling the episode info

Another option:

results |>
  separate_wider_regex(episode, 
                       patterns = c(".+ ", 
                                    episode = "\\d+", 
                                    ": ", 
                                    episode_name = ".+", 
                                    "\\. \\(", 
                                    air_date = ".+", 
                                    "\\)"))
# A tibble: 270 × 3
   episode episode_name        air_date     
   <chr>   <chr>               <chr>        
 1 1       It's not your fault 18 March 2021
 2 1       It's not your fault 18 March 2021
 3 1       It's not your fault 18 March 2021
 4 1       It's not your fault 18 March 2021
 5 1       It's not your fault 18 March 2021
 6 1       It's not your fault 18 March 2021
 7 1       It's not your fault 18 March 2021
 8 1       It's not your fault 18 March 2021
 9 1       It's not your fault 18 March 2021
10 1       It's not your fault 18 March 2021
# ℹ 260 more rows