Homework 9
Due: Friday, November 10, 11:00am on Canvas
Instructions:
- Download the HW 9 template, and open the template (a Quarto document) in RStudio.
- Put your name in the file header
- Click
Render
- Type all code and answers in the document (using
###
for section headings and####
for question headings) - Render early and often to catch any errors!
- When you are finished, submit the final rendered HTML to Canvas
Resources: In addition to the class notes and activities, I recommend reading the following resources:
- Chapter 16 in R for Data Science (2nd edition)
- Chapter 25 in R for Data Science (2nd edition)
The Great British Bake Off
The Great British Bake Off (called the Great British Baking Show in the US because of trademark issues with Pillsbury – yes, really) is a British competition baking show. Each episode involves three challenges: a signature bake, a technical challenge, and a showstopper, all centered around a theme (bread week, cake week, pastry week, etc.). The participant who performs worst is eliminated (with a couple rare exceptions), and the participant who performs best is awarded “star baker” for the week.
The goal of this assignment is to use web scraping and data wrangling (including working with strings) to collect and analyze data about the show. We will scrape the data from Wikipedia articles about the show.
Getting the episode names
We will begin with series 2 (series 1 had a slightly different format, so we will ignore it for now). The Wikipedia article on series 2 can be found at:
https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_(series_2)
If you scroll down, you will notice that there is an “Episodes” section, which contains the headings “Episode 1: Cakes”, “Episode 2: Tarts”, etc.
Question 1
Use the Inspect
tool in Chrome to find the selector for
these episode titles, then use the read_html
,
html_elements
, and html_text2
functions to
create a vector of the episode titles. The output should look like
this:
## [1] "Episode 1: Cakes" "Episode 2: Tarts"
## [3] "Episode 3: Bread" "Episode 4: Biscuits"
## [5] "Episode 5: Pies" "Episode 6: Desserts (Quarterfinals)"
## [7] "Episode 7: Pâtisserie (Semi-final)" "Episode 8: Final"
Question 2
Create a data frame which contains the episode information in the following format:
## # A tibble: 8 × 2
## episode episode_name
## <chr> <chr>
## 1 1 "Cakes"
## 2 2 "Tarts"
## 3 3 "Bread"
## 4 4 "Biscuits"
## 5 5 "Pies"
## 6 6 "Desserts "
## 7 7 "Pâtisserie "
## 8 8 "Final"
Follow the example of using separate_wider_regex
from
the slides.
Hint: the too_few = "align_start"
option in
separate_wider_regex
is useful for handling optional
patterns at the end of a string, like parentheses…
Question 3
Now try using your code from question 2 to scrape the episode information for series 4. You should get an error.
Explain what goes wrong, and why you get that error (look at the Wikipedia page!).
To fix the error from question 3, we will get rid of the “Episodes”
from the masterclass. This can be done with a little help from the
str_detect
function, which returns TRUE if a string
contains a match to the specified pattern.
For example:
## [1] TRUE TRUE FALSE
Question 4
Use the str_detect
function to help remove the “Episode”
titles which were causing the error in question 3. Then create the table
of episode information:
## # A tibble: 10 × 2
## episode episode_name
## <chr> <chr>
## 1 1 Cakes
## 2 2 Bread
## 3 3 Desserts
## 4 4 Pies and Tarts
## 5 5 Biscuits and Traybakes
## 6 6 Sweet Dough
## 7 7 Pastry
## 8 8 Alternative Ingredients
## 9 9 French week
## 10 10 Final
Question 5
Adapting the code from question 4, iterate over series 2 – 13, and combine the episode information for all series into a single data frame that looks like this:
## episode episode_name series
## 1: 1 Cakes 2
## 2: 2 Tarts 2
## 3: 3 Bread 2
## 4: 4 Biscuits 2
## 5: 5 Pies 2
## ---
## 114: 6 Halloween 13
## 115: 7 Custard 13
## 116: 8 Pastry 13
## 117: 9 Pâtisserie 13
## 118: 10 Final 13
Hint: as in previous assignments, the paste0
and
rbindlist
functions may be useful
Using your data from question 5, answer the following questions about the weekly themes across the series.
Question 6
How many episode themes have appeared only once?
Question 7
Which episode themes appear in every series?
Note: looking at the episode information, you will see that “Cakes” appears 7 times, and “Cake” appears 5 times. So, “Cake week” actually happens every series! The issue here is that “Cake” vs. “Cakes” looks different, but they are really the same theme (just singular vs. plural). How should we handle the issue with pluralization?
There are some other issues: should we count “Biscuits” the same as “Biscuits and Traybakes”? Are “Pies” the same as “Pies and Tarts”? And depending on how you wrote the regular expressions, some of the names might have trailing white space, which makes “Alternative Ingredients” look different from “Alternative Ingredients”.
Handling the extra white space is straightforward: the
trimws
function in base R will do that for us, or we can
modify our regular expressions. The other issues are more complicated;
we may return to them at a later date.
Getting the contestants
Now let’s scrape some of the tabular data. Each series has a Wikipedia page, and these pages contain several tables. The first table is information about each baker, and then there are tables with the results for each episode.
Question 8
Use the Inspect
tool in Chrome to find the selector for
these tables on the series 2 page. There should be one selector that
will identify all the tables we want: using the
html_elements
function with this selector, and then the
html_table
function, will produce a list of data frames.
There should be 9 tables for series 2 (one table for the contestants,
and 8 tables for the episodes).
Question 9
Look at the first table, which contains contestant information. How many contestants appeared on series 2?
Question 10
Repeat question 8 for series 5. What do you notice about the name of the “Baker” column?
Footnotes like this can cause a problem when we want to combine information across the series. Let’s sanitize the names of the tables to make sure the information can be combined.
Question 11
Use the rename
and starts_with
functions to
rename the Baker[3]
column baker
in Series
5.
Question 12
Iterate over series 2 – 13, extracting the contestant table for each
series, and combine the results into a single data frame (you can remove
the Links
column). The results should look like this:
## baker age occupation
## 1: Ben Frazer 31 Graphic Designer
## 2: Holly Bell 31 Advertising executive
## 3: Ian Vallance 40 Fundraiser for English Heritage
## 4: Janet Basu 63 Teacher of Modern Languages
## 5: Jason White 19 Civil Engineering Student
## ---
## 142: Marie-Therese "Maxy" Maligisa 29 Architectural assistant
## 143: Rebecca "Rebs" Lightbody 23 Masters student
## 144: Nelsandro "Sandro" Farmhouse 30 Nanny
## 145: Syabira Yusoff 32 Cardiovascular research associate
## 146: Will Hawkins 45 Former charity director
## hometown series
## 1: Northampton 2
## 2: Leicester 2
## 3: Dunstable, Bedfordshire 2
## 4: Formby, Liverpool 2
## 5: Croydon 2
## ---
## 142: London 13
## 143: County Antrim 13
## 144: London 13
## 145: London 13
## 146: London 13
Using your data from question 12, answer the following questions about the contestants over the course of the show.
Question 13
How many contestants participated in each series?
Question 14
Have the contestant ages changed over the show? Calculate relevant summary statistics and create a plot to address the question.