Homework 9

Due: Friday, November 10, 11:00am on Canvas

Instructions:

Download the HW 9 template, and open the template (a Quarto document) in RStudio.
Put your name in the file header
Click Render
Type all code and answers in the document (using ### for section headings and #### for question headings)
Render early and often to catch any errors!
When you are finished, submit the final rendered HTML to Canvas

Resources: In addition to the class notes and activities, I recommend reading the following resources:

Chapter 16 in R for Data Science (2nd edition)
Chapter 25 in R for Data Science (2nd edition)

The Great British Bake Off

The Great British Bake Off (called the Great British Baking Show in the US because of trademark issues with Pillsbury – yes, really) is a British competition baking show. Each episode involves three challenges: a signature bake, a technical challenge, and a showstopper, all centered around a theme (bread week, cake week, pastry week, etc.). The participant who performs worst is eliminated (with a couple rare exceptions), and the participant who performs best is awarded “star baker” for the week.

The goal of this assignment is to use web scraping and data wrangling (including working with strings) to collect and analyze data about the show. We will scrape the data from Wikipedia articles about the show.

Getting the episode names

We will begin with series 2 (series 1 had a slightly different format, so we will ignore it for now). The Wikipedia article on series 2 can be found at:

https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_(series_2)

If you scroll down, you will notice that there is an “Episodes” section, which contains the headings “Episode 1: Cakes”, “Episode 2: Tarts”, etc.

Question 1

Use the Inspect tool in Chrome to find the selector for these episode titles, then use the read_html, html_elements, and html_text2 functions to create a vector of the episode titles. The output should look like this:

## [1] "Episode 1: Cakes"                    "Episode 2: Tarts"                   
## [3] "Episode 3: Bread"                    "Episode 4: Biscuits"                
## [5] "Episode 5: Pies"                     "Episode 6: Desserts (Quarterfinals)"
## [7] "Episode 7: Pâtisserie (Semi-final)"  "Episode 8: Final"

Question 2

Create a data frame which contains the episode information in the following format:

## # A tibble: 8 × 2
##   episode episode_name 
##   <chr>   <chr>        
## 1 1       "Cakes"      
## 2 2       "Tarts"      
## 3 3       "Bread"      
## 4 4       "Biscuits"   
## 5 5       "Pies"       
## 6 6       "Desserts "  
## 7 7       "Pâtisserie "
## 8 8       "Final"

Follow the example of using separate_wider_regex from the slides.

Hint: the too_few = "align_start" option in separate_wider_regex is useful for handling optional patterns at the end of a string, like parentheses…

Question 3

Now try using your code from question 2 to scrape the episode information for series 4. You should get an error.

Explain what goes wrong, and why you get that error (look at the Wikipedia page!).

To fix the error from question 3, we will get rid of the “Episodes” from the masterclass. This can be done with a little help from the str_detect function, which returns TRUE if a string contains a match to the specified pattern.

For example:

str_detect(c("ab: c", "xy: z", "ab"), ":")

## [1]  TRUE  TRUE FALSE

Question 4

Use the str_detect function to help remove the “Episode” titles which were causing the error in question 3. Then create the table of episode information:

## # A tibble: 10 × 2
##    episode episode_name           
##    <chr>   <chr>                  
##  1 1       Cakes                  
##  2 2       Bread                  
##  3 3       Desserts               
##  4 4       Pies and Tarts         
##  5 5       Biscuits and Traybakes 
##  6 6       Sweet Dough            
##  7 7       Pastry                 
##  8 8       Alternative Ingredients
##  9 9       French week            
## 10 10      Final

Question 5

Adapting the code from question 4, iterate over series 2 – 13, and combine the episode information for all series into a single data frame that looks like this:

##      episode episode_name series
##   1:       1        Cakes      2
##   2:       2        Tarts      2
##   3:       3        Bread      2
##   4:       4     Biscuits      2
##   5:       5         Pies      2
##  ---                            
## 114:       6    Halloween     13
## 115:       7      Custard     13
## 116:       8       Pastry     13
## 117:       9   Pâtisserie     13
## 118:      10        Final     13

Hint: as in previous assignments, the paste0 and rbindlist functions may be useful

Using your data from question 5, answer the following questions about the weekly themes across the series.

Question 6

How many episode themes have appeared only once?

Question 7

Which episode themes appear in every series?

Note: looking at the episode information, you will see that “Cakes” appears 7 times, and “Cake” appears 5 times. So, “Cake week” actually happens every series! The issue here is that “Cake” vs. “Cakes” looks different, but they are really the same theme (just singular vs. plural). How should we handle the issue with pluralization?

There are some other issues: should we count “Biscuits” the same as “Biscuits and Traybakes”? Are “Pies” the same as “Pies and Tarts”? And depending on how you wrote the regular expressions, some of the names might have trailing white space, which makes “Alternative Ingredients” look different from “Alternative Ingredients”.

Handling the extra white space is straightforward: the trimws function in base R will do that for us, or we can modify our regular expressions. The other issues are more complicated; we may return to them at a later date.

Getting the contestants

Now let’s scrape some of the tabular data. Each series has a Wikipedia page, and these pages contain several tables. The first table is information about each baker, and then there are tables with the results for each episode.

Question 8

Use the Inspect tool in Chrome to find the selector for these tables on the series 2 page. There should be one selector that will identify all the tables we want: using the html_elements function with this selector, and then the html_table function, will produce a list of data frames. There should be 9 tables for series 2 (one table for the contestants, and 8 tables for the episodes).

Question 9

Look at the first table, which contains contestant information. How many contestants appeared on series 2?

Question 10

Repeat question 8 for series 5. What do you notice about the name of the “Baker” column?

Footnotes like this can cause a problem when we want to combine information across the series. Let’s sanitize the names of the tables to make sure the information can be combined.

Question 11

Use the rename and starts_with functions to rename the Baker[3] column baker in Series 5.

Question 12

Iterate over series 2 – 13, extracting the contestant table for each series, and combine the results into a single data frame (you can remove the Links column). The results should look like this:

##                              baker age                        occupation
##   1:                    Ben Frazer  31                  Graphic Designer
##   2:                    Holly Bell  31             Advertising executive
##   3:                  Ian Vallance  40   Fundraiser for English Heritage
##   4:                    Janet Basu  63       Teacher of Modern Languages
##   5:                   Jason White  19         Civil Engineering Student
##  ---                                                                    
## 142: Marie-Therese "Maxy" Maligisa  29           Architectural assistant
## 143:      Rebecca "Rebs" Lightbody  23                   Masters student
## 144:  Nelsandro "Sandro" Farmhouse  30                             Nanny
## 145:                Syabira Yusoff  32 Cardiovascular research associate
## 146:                  Will Hawkins  45           Former charity director
##                     hometown series
##   1:             Northampton      2
##   2:               Leicester      2
##   3: Dunstable, Bedfordshire      2
##   4:       Formby, Liverpool      2
##   5:                 Croydon      2
##  ---                               
## 142:                  London     13
## 143:           County Antrim     13
## 144:                  London     13
## 145:                  London     13
## 146:                  London     13

Using your data from question 12, answer the following questions about the contestants over the course of the show.

Question 13

How many contestants participated in each series?

Question 14

Have the contestant ages changed over the show? Calculate relevant summary statistics and create a plot to address the question.