read_html("https://taskmaster.fandom.com/wiki/Series_11") |>
html_element(".tmtable") |>
html_table()
# A tibble: 75 × 7
Task Description `Charlotte Ritchie` `Jamali Maddix` `Lee Mack`
<chr> <chr> <chr> <chr> <chr>
1 Episode 1: It's n… Episode 1:… Episode 1: It's no… Episode 1: It'… Episode 1…
2 1 Prize: Bes… 1 2 4
3 2 Do the mos… 2 3[1] 3
4 3 Catch the … DQ 1 5
5 4 Deliver al… 2 1 5
6 5 Live: Stac… 0 0 0
7 Total Total 5 7 17
8 Episode 2: The Lu… Episode 2:… Episode 2: The Lur… Episode 2: The… Episode 2…
9 1 Prize: Bes… 5 1 2
10 2 Make the b… 0 5 0
# ℹ 65 more rows
# ℹ 2 more variables: `Mike Wozniak` <chr>, `Sarah Kendall` <chr>
Here’s what we have so far:
Task Description `Charlotte Ritchie` `Jamali Maddix` `Lee Mack` ...
Episode 1: It's… Episode 1:… Episode 1: It's no… Episode 1: It'… Episode 1…
1 Prize: Bes… 1 2 4
2 Do the mos… 2 3[1] 3
3 Catch the … DQ 1 5
4 Deliver al… 2 1 5
5 Live: Stac… 0 0 0
Total Total 5 7 17
Episode 2: The … Episode 2:… Episode 2: The Lur… Episode 2: The… Episode 2…
1 Prize: Bes… 5 1 2
2 Make the b… 0 5 0
What changes do you think we should make to this format?
What we ultimately want:
Task Description episode episode_name air_date contestant score series
1 1 Prize: Best th… 1 "It's not y… 18 Marc… Charlotte… 1 11
2 1 Prize: Best th… 1 "It's not y… 18 Marc… Jamali Ma… 2 11
3 1 Prize: Best th… 1 "It's not y… 18 Marc… Lee Mack 4 11
4 1 Prize: Best th… 1 "It's not y… 18 Marc… Mike Wozn… 5 11
5 1 Prize: Best th… 1 "It's not y… 18 Marc… Sarah Ken… 3 11
6 2 Do the most im… 1 "It's not y… 18 Marc… Charlotte… 2 11
7 2 Do the most im… 1 "It's not y… 18 Marc… Jamali Ma… 3[1] 11
8 2 Do the most im… 1 "It's not y… 18 Marc… Lee Mack 3 11
9 2 Do the most im… 1 "It's not y… 18 Marc… Mike Wozn… 5 11
10 2 Do the most im… 1 "It's not y… 18 Marc… Sarah Ken… 4 11
colnames: Task, Description, episode, episode_name, air_date, contestant, score, series
Intermediate step:
Task Description episode contestant score series
1 Prize: Best thing… Episode 1… Charlotte… 1 11
1 Prize: Best thing… Episode 1… Jamali Ma… 2 11
1 Prize: Best thing… Episode 1… Lee Mack 4 11
1 Prize: Best thing… Episode 1… Mike Wozn… 5 11
1 Prize: Best thing… Episode 1… Sarah Ken… 3 11
2 Do the most… Episode 1… Charlotte… 2 11
2 Do the most… Episode 1… Jamali Ma… 3 11
2 Do the most… Episode 1… Lee Mack 3 11
2 Do the most… Episode 1… Mike Wozn… 5 11
2 Do the most… Episode 1… Sarah Ken… 4 11
Here’s what we have so far:
Task Description `Charlotte Ritchie` `Jamali Maddix` `Lee Mack` ...
Episode 1: It's… Episode 1:… Episode 1: It's no… Episode 1: It'… Episode 1…
1 Prize: Bes… 1 2 4
2 Do the mos… 2 3[1] 3
3 Catch the … DQ 1 5
4 Deliver al… 2 1 5
5 Live: Stac… 0 0 0
Total Total 5 7 17
Episode 2: The … Episode 2:… Episode 2: The Lur… Episode 2: The… Episode 2…
1 Prize: Bes… 5 1 2
2 Make the b… 0 5 0
What wrangling steps do we need to take?
Step 1: create a separate column for episode:
# A tibble: 75 × 2
Task episode
<chr> <chr>
1 Episode 1: It's not your fault. (18 March 2021) Episode 1: It's …
2 1 <NA>
3 2 <NA>
4 3 <NA>
5 4 <NA>
6 5 <NA>
7 Total <NA>
8 Episode 2: The Lure of the Treacle Puppies. (25 March 2021) Episode 2: The L…
9 1 <NA>
10 2 <NA>
# ℹ 65 more rows
Step 2: fill in the episodes
Step 3: remove the “Total” and “Episode” rows in the Task column
What R code would we use to remove these rows?
Step 3: remove the “Total” and “Episode” rows in the Task column
read_html("https://taskmaster.fandom.com/wiki/Series_11") |>
html_element(".tmtable") |>
html_table() |>
mutate(episode = ifelse(startsWith(Task, "Episode"), Task, NA)) |>
fill(episode, .direction = "down") |>
filter(!startsWith(Task, "Episode"),
!(Task %in% c("Total", "Grand Total")))
# A tibble: 54 × 8
Task Description `Charlotte Ritchie` `Jamali Maddix` `Lee Mack`
<chr> <chr> <chr> <chr> <chr>
1 1 Prize: Best thing you c… 1 2 4
2 2 Do the most impressive … 2 3[1] 3
3 3 Catch the rat. You must… DQ 1 5
4 4 Deliver all the plates … 2 1 5
5 5 Live: Stack your bucket… 0 0 0
6 1 Prize: Best drinking ve… 5 1 2
7 2 Make the balloon hover … 0 5 0
8 3 Team: Have an argument.… 2 2 5
9 4 Make the house haunted. 3 3 4
10 5 Live: Correctly guess y… 2 2 5
# ℹ 44 more rows
# ℹ 3 more variables: `Mike Wozniak` <chr>, `Sarah Kendall` <chr>,
# episode <chr>
Step 4: Pivot
# A tibble: 54 × 8
Task Description `Charlotte Ritchie` `Jamali Maddix` `Lee Mack`
<chr> <chr> <chr> <chr> <chr>
1 1 Prize: Best thing you c… 1 2 4
2 2 Do the most impressive … 2 3[1] 3
3 3 Catch the rat. You must… DQ 1 5
4 4 Deliver all the plates … 2 1 5
5 5 Live: Stack your bucket… 0 0 0
6 1 Prize: Best drinking ve… 5 1 2
7 2 Make the balloon hover … 0 5 0
8 3 Team: Have an argument.… 2 2 5
9 4 Make the house haunted. 3 3 4
10 5 Live: Correctly guess y… 2 2 5
# ℹ 44 more rows
# ℹ 3 more variables: `Mike Wozniak` <chr>, `Sarah Kendall` <chr>,
# episode <chr>
How should we pivot this data?
Step 4: Pivot
read_html("https://taskmaster.fandom.com/wiki/Series_11") |>
html_element(".tmtable") |>
html_table() |>
mutate(episode = ifelse(startsWith(Task, "Episode"), Task, NA)) |>
fill(episode, .direction = "down") |>
filter(!startsWith(Task, "Episode"),
!(Task %in% c("Total", "Grand Total"))) |>
pivot_longer(cols = -c(Task, Description, episode),
names_to = "contestant",
values_to = "score") |>
mutate(series = 11)
# A tibble: 270 × 6
Task Description episode contestant score series
<chr> <chr> <chr> <chr> <chr> <dbl>
1 1 Prize: Best thing you can carry, but o… Episod… Charlotte… 1 11
2 1 Prize: Best thing you can carry, but o… Episod… Jamali Ma… 2 11
3 1 Prize: Best thing you can carry, but o… Episod… Lee Mack 4 11
4 1 Prize: Best thing you can carry, but o… Episod… Mike Wozn… 5 11
5 1 Prize: Best thing you can carry, but o… Episod… Sarah Ken… 3 11
6 2 Do the most impressive thing under the… Episod… Charlotte… 2 11
7 2 Do the most impressive thing under the… Episod… Jamali Ma… 3[1] 11
8 2 Do the most impressive thing under the… Episod… Lee Mack 3 11
9 2 Do the most impressive thing under the… Episod… Mike Wozn… 5 11
10 2 Do the most impressive thing under the… Episod… Sarah Ken… 4 11
# ℹ 260 more rows