Task Description episode episode_name air_date contestant score series11 Prize: Best th… 1"It's not y… 18 Marc… Charlotte… 1 11 2 1 Prize: Best th… 1 "It's not y… 18 Marc… Jamali Ma… 2 11 3 1 Prize: Best th… 1 "It's not y… 18 Marc… Lee Mack 41141 Prize: Best th… 1"It's not y… 18 Marc… Mike Wozn… 5 11 5 1 Prize: Best th… 1 "It's not y… 18 Marc… Sarah Ken… 3 11 6 2 Do the most im… 1 "It's not y… 18 Marc… Charlotte… 21172 Do the most im… 1"It's not y… 18 Marc… Jamali Ma… 3 11 8 2 Do the most im… 1 "It's not y… 18 Marc… Lee Mack 3 11 9 2 Do the most im… 1 "It's not y… 18 Marc… Mike Wozn… 511102 Do the most im… 1"It's not y… 18 Marc… Sarah Ken… 4 11
colnames: Task, Description, episode, episode_name, air_date, contestant, score, series
Task Description episode contestant score series1 Prize: Best thing… Episode 1… Charlotte… 1111 Prize: Best thing… Episode 1… Jamali Ma… 2111 Prize: Best thing… Episode 1… Lee Mack 4111 Prize: Best thing… Episode 1… Mike Wozn… 5111 Prize: Best thing… Episode 1… Sarah Ken… 3112 Do the most… Episode 1… Charlotte… 2112 Do the most… Episode 1… Jamali Ma… 3[1] 112 Do the most… Episode 1… Lee Mack 3112 Do the most… Episode 1… Mike Wozn… 5112 Do the most… Episode 1… Sarah Ken… 411
Currently, the episode column contains entries like
"Episode 1: It's not your fault. (18 March 2021)"
Next steps
Separate episode info into episode number, episode name, and air date columns
Clean up the score column
Combine data from multiple series
Goal for today: start learning some tools for 1. and 2.
How do we want to clean these scores? How should the scores be stored?
Extracting numeric information
Suppose we have the following string:
"3[1]"
And we want to extract just the number “3”:
str_extract("3[1]", "3")
[1] "3"
Extracting numeric information
Suppose we have the following string:
"3[1]"
What if we don’t know which number to extract?
str_extract("3[1]", "\\d")
[1] "3"
str_extract("4[1]", "\\d")
[1] "4"
str_extract("DQ", "\\d")
[1] NA
Regular expressions
A regular expression is a pattern used to find matches in text.
The simplest regular expressions match a specific character or sequence of characters:
str_extract("My cat is 3 years old", "cat")
[1] "cat"
str_extract("My cat is 3 years old", "3")
[1] "3"
Matching multiple options
We can also provide multiple options for the match
str_extract("My cat is 3 years old", "cat|dog")
[1] "cat"
str_extract("My dog is 10 years old", "cat|dog")
[1] "dog"
str_extract("My dog is 10 years old, my cat is 3 years old", "cat|dog")
[1] "dog"
str_extract_all("My dog is 10 years old, my cat is 3 years old", "cat|dog")
[[1]]
[1] "dog" "cat"
Matching groups of characters
What if I want to extract a number?
str_extract("My cat is 3 years old", "\\d")
[1] "3"
What do you think will happen when I run the following code?
str_extract("My dog is 10 years old", "\\d")
Matching groups of characters
What if I want to extract a number?
str_extract("My cat is 3 years old", "\\d")
[1] "3"
What do you think will happen when I run the following code?
str_extract("My dog is 10 years old", "\\d")
[1] "1"
Matching groups of characters
The + symbol in a regular expression means “repeated one or more times”
str_extract("My dog is 10 years old", "\\d+")
[1] "10"
Extracting from multiple strings
strings <-c("My cat is 3 years old", "My dog is 10 years old")str_extract(strings, "\\d+")
[1] "3" "10"
Extracting episode information
Currently, the episode column contains entries like:
"Episode 2: The pie whisperer. (4 August 2015)"
How would I extract just the episode number?
Extracting episode information
Currently, the episode column contains entries like:
"Episode 2: The pie whisperer. (4 August 2015)"
How would I extract just the episode number?
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "\\d+")
[1] "2"
Extracting episode information
Currently, the episode column contains entries like:
"Episode 2: The pie whisperer. (4 August 2015)"
How would I extract the episode name?
Extracting episode information
"Episode 2: The pie whisperer. (4 August 2015)"
Pattern to match: anything that starts with a :, ends with a .
Note: The . character in a regex means “any character”
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".")
[1] "E"
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".+")
[1] "Episode 2: The pie whisperer. (4 August 2015)"
Extracting episode information
Note: The . character in a regex means “any character”
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".")
[1] "E"
We use an escape character when we actually want to choose a period:
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "\\.")
[1] "."
Extracting episode information
Getting everything between the : and the .
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ":.+\\.")
[1] ": The pie whisperer."
Extracting episode information
Getting everything between the : and the .
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "(?<=: ).+(?=\\.)")
[1] "The pie whisperer"
Lookbehinds
(?<=) is a positive lookbehind. It is used to identify expressions which are preceded by a particular expression.
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "(?<=: ).+")
[1] "The pie whisperer. (4 August 2015)"
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "(?<=\\. ).+")
[1] "(4 August 2015)"
Lookaheads
(?=) is a positive lookahead. It is used to identify expressions which are followed by a particular expression.
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".+(?=\\.)")
[1] "Episode 2: The pie whisperer"
str_extract("Episode 2: The pie whisperer. (4 August 2015)", ".+(?=:)")
[1] "Episode 2"
Extracting air date
I want to extract just the air date. What pattern do I want to match?
str_extract("Episode 2: The pie whisperer. (4 August 2015)", )
Extracting air date
str_extract("Episode 2: The pie whisperer. (4 August 2015)", "(?<=\\().+(?=\\))")
Wrangling the episode info
Currently:
# A tibble: 270 × 1
episode
<chr>
1 Episode 1: It's not your fault. (18 March 2021)
2 Episode 1: It's not your fault. (18 March 2021)
3 Episode 1: It's not your fault. (18 March 2021)
4 Episode 1: It's not your fault. (18 March 2021)
5 Episode 1: It's not your fault. (18 March 2021)
6 Episode 1: It's not your fault. (18 March 2021)
7 Episode 1: It's not your fault. (18 March 2021)
8 Episode 1: It's not your fault. (18 March 2021)
9 Episode 1: It's not your fault. (18 March 2021)
10 Episode 1: It's not your fault. (18 March 2021)
# ℹ 260 more rows
# A tibble: 270 × 3
episode episode_name air_date
<chr> <chr> <chr>
1 1 It's not your fault 18 March 2021
2 1 It's not your fault 18 March 2021
3 1 It's not your fault 18 March 2021
4 1 It's not your fault 18 March 2021
5 1 It's not your fault 18 March 2021
6 1 It's not your fault 18 March 2021
7 1 It's not your fault 18 March 2021
8 1 It's not your fault 18 March 2021
9 1 It's not your fault 18 March 2021
10 1 It's not your fault 18 March 2021
# ℹ 260 more rows
# A tibble: 270 × 3
episode episode_name air_date
<chr> <chr> <chr>
1 1 It's not your fault 18 March 2021
2 1 It's not your fault 18 March 2021
3 1 It's not your fault 18 March 2021
4 1 It's not your fault 18 March 2021
5 1 It's not your fault 18 March 2021
6 1 It's not your fault 18 March 2021
7 1 It's not your fault 18 March 2021
8 1 It's not your fault 18 March 2021
9 1 It's not your fault 18 March 2021
10 1 It's not your fault 18 March 2021
# ℹ 260 more rows