Lecture 19: Intro to web scraping

Motivation: Taskmaster

Show information

  • 15 full series (currently on series 16)
  • Each series involves 5 contestants, competing over 5-10 episodes
  • Each episode involves approximately 5 tasks
  • Contestants are scored from 1-5 (roughly) on each task

Taskmaster data

https://taskmaster.fandom.com/wiki/Series_11

Goal and required steps

Goal: Explore the Taskmaster data across all completed series. Which contestants did worst? Which contestants did best? Did the scoring change over the series?

  • Scrape data from each series from the website
  • Combine, clean, and transform
  • Do statistics!

Scraping the data

library(rvest)
library(tidyverse)

tm <- read_html("https://taskmaster.fandom.com/wiki/Series_11")
tm
{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...

HTML basics

Here is a basic HTML page:

<html>
<head>
  <title>Page title</title>
</head>
<body>
  <h1 id='first'>A heading</h1>
  <p>Some text &amp; <b>some bold text.</b></p>
  <img src='myimg.png' width='100' height='100'>
</body>

Some HTML elements

  • <html>: start of the HTML page
  • <head>: header information (metadata about the page)
  • <body>: everything that is on the page
  • <p>: paragraphs
  • <b>: bold
  • <table>: table

Extracting HTML elements

The Taskmaster data we want looks like it is stored in a table. How can we extract it?

tm |>
  html_elements("table")
{xml_nodeset (4)}
[1] <table style="width: 100%; text-align: center; border: 1px solid #891100; ...
[2] <table class="toccolours" align="center" style="background: #891100; colo ...
[3] <table class="pi-horizontal-group">\n<caption class="pi-header pi-seconda ...
[4] <table class="tmtable"><tbody>\n<tr class="tmtableheader">\n<th>Task\n</t ...

html_elements returns all the elements matching the selector.

Extracting HTML elements

tm |>
  html_elements("table")
{xml_nodeset (4)}
[1] <table style="width: 100%; text-align: center; border: 1px solid #891100; ...
[2] <table class="toccolours" align="center" style="background: #891100; colo ...
[3] <table class="pi-horizontal-group">\n<caption class="pi-header pi-seconda ...
[4] <table class="tmtable"><tbody>\n<tr class="tmtableheader">\n<th>Task\n</t ...

How do we know which table we want?

Finding the right selectors

  1. Open the webpage in Chrome
  2. Right-click on the element you want, and click “Inspect”

Finding the right selector

tm |> 
  html_element("[class='tmtable']") |> 
  html_table()
# A tibble: 75 × 7
   Task               Description `Charlotte Ritchie` `Jamali Maddix` `Lee Mack`
   <chr>              <chr>       <chr>               <chr>           <chr>     
 1 Episode 1: It's n… Episode 1:… Episode 1: It's no… Episode 1: It'… Episode 1…
 2 1                  Prize: Bes… 1                   2               4         
 3 2                  Do the mos… 2                   3[1]            3         
 4 3                  Catch the … DQ                  1               5         
 5 4                  Deliver al… 2                   1               5         
 6 5                  Live: Stac… 0                   0               0         
 7 Total              Total       5                   7               17        
 8 Episode 2: The Lu… Episode 2:… Episode 2: The Lur… Episode 2: The… Episode 2…
 9 1                  Prize: Bes… 5                   1               2         
10 2                  Make the b… 0                   5               0         
# ℹ 65 more rows
# ℹ 2 more variables: `Mike Wozniak` <chr>, `Sarah Kendall` <chr>

Finding the right selector

tm |> 
  html_element(".tmtable") |> 
  html_table()
# A tibble: 75 × 7
   Task               Description `Charlotte Ritchie` `Jamali Maddix` `Lee Mack`
   <chr>              <chr>       <chr>               <chr>           <chr>     
 1 Episode 1: It's n… Episode 1:… Episode 1: It's no… Episode 1: It'… Episode 1…
 2 1                  Prize: Bes… 1                   2               4         
 3 2                  Do the mos… 2                   3[1]            3         
 4 3                  Catch the … DQ                  1               5         
 5 4                  Deliver al… 2                   1               5         
 6 5                  Live: Stac… 0                   0               0         
 7 Total              Total       5                   7               17        
 8 Episode 2: The Lu… Episode 2:… Episode 2: The Lur… Episode 2: The… Episode 2…
 9 1                  Prize: Bes… 5                   1               2         
10 2                  Make the b… 0                   5               0         
# ℹ 65 more rows
# ℹ 2 more variables: `Mike Wozniak` <chr>, `Sarah Kendall` <chr>

Extracting non-tabular data

https://taskmaster.fandom.com/wiki/Charlotte_Ritchie

How would we scrape Charlotte Ritchie’s birthday?

Identifying the right selector

read_html("https://taskmaster.fandom.com/wiki/Charlotte_Ritchie") |>
  html_element("[data-source='born']")
{html_node}
<div class="pi-item pi-data pi-item-spacing pi-border-color" data-source="born">
[1] <h3 class="pi-data-label pi-secondary-font">Born</h3>
[2] <div class="pi-data-value pi-font">29 August 1989</div>

Identifying the right selector

read_html("https://taskmaster.fandom.com/wiki/Charlotte_Ritchie") |>
  html_element("[data-source='born'] > .pi-font")
{html_node}
<div class="pi-data-value pi-font">

Identifying the right selector

read_html("https://taskmaster.fandom.com/wiki/Charlotte_Ritchie") |>
  html_element("[data-source='born'] > .pi-font") |>
  html_text2()
[1] "29 August 1989"

Accessing HTML attributes

Accessing HTML attributes

read_html("https://taskmaster.fandom.com/wiki/Series_11") |>
  html_elements("td")
{xml_nodeset (436)}
 [1] <td>\n<table class="toccolours" align="center" style="background: #89110 ...
 [2] <td>\n<a href="/wiki/Series_10" title="Series 10"><span style="color: #F ...
 [3] <td align="center">\n<span style="font-family: Veteran Typewriter;"><a h ...
 [4] <td class="pi-horizontal-group-item pi-data-value pi-font pi-border-colo ...
 [5] <td class="pi-horizontal-group-item pi-data-value pi-font pi-border-colo ...
 [6] <td colspan="7">Episode 1: <span style="font-family: Veteran Typewriter; ...
 [7] <td>\n<a href="/wiki/Best_thing_you_can_carry,_but_only_just" title="Bes ...
 [8] <td>\n<b>Prize:</b> Best thing you can carry, but only just.\n</td>
 [9] <td>1\n</td>
[10] <td>2\n</td>
[11] <td>4\n</td>
[12] <td>\n<b>5</b>\n</td>
[13] <td>3\n</td>
[14] <td>\n<a href="/wiki/Do_the_most_impressive_thing_under_the_table_with_o ...
[15] <td>Do the most impressive thing under the table with one hand. You must ...
[16] <td>2\n</td>
[17] <td>3<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[1]</ ...
[18] <td>3\n</td>
[19] <td>\n<b>5</b>\n</td>
[20] <td>4\n</td>
...

Accessing HTML attributes

Accessing HTML attributes

read_html("https://taskmaster.fandom.com/wiki/Series_11") |>
  html_elements("td[align='center']")
{xml_nodeset (1)}
[1] <td align="center">\n<span style="font-family: Veteran Typewriter;"><a hr ...

Accessing HTML attributes

Accessing HTML attributes

read_html("https://taskmaster.fandom.com/wiki/Series_11") |>
  html_elements("td[align='center'] > span")
{xml_nodeset (10)}
 [1] <span style="font-family: Veteran Typewriter;"><a href="/wiki/It%27s_not ...
 [2] <span style="font-family: Veteran Typewriter;"><a href="/wiki/The_Lure_o ...
 [3] <span style="font-family: Veteran Typewriter;"><a href="/wiki/Run_up_a_t ...
 [4] <span style="font-family: Veteran Typewriter;"><a href="/wiki/Premature_ ...
 [5] <span style="font-family: Veteran Typewriter;"><a href="/wiki/Slap_and_t ...
 [6] <span style="font-family: Veteran Typewriter;"><a href="/wiki/Absolute_c ...
 [7] <span style="font-family: Veteran Typewriter;"><a href="/wiki/You%27ve_g ...
 [8] <span style="font-family: Veteran Typewriter;"><a href="/wiki/An_orderly ...
 [9] <span style="font-family: Veteran Typewriter;"><a href="/wiki/Mr_Octopus ...
[10] <span style="font-family: Veteran Typewriter;"><a href="/wiki/Activate_J ...

Accessing HTML attributes

Accessing HTML attributes

read_html("https://taskmaster.fandom.com/wiki/Series_11") |>
  html_elements("td[align='center'] > span") |>
  html_element("a")
{xml_nodeset (10)}
 [1] <a href="/wiki/It%27s_not_your_fault." title="It's not your fault.">It's ...
 [2] <a href="/wiki/The_Lure_of_the_Treacle_Puppies." title="The Lure of the  ...
 [3] <a href="/wiki/Run_up_a_tree_to_the_moon." title="Run up a tree to the m ...
 [4] <a href="/wiki/Premature_conker." title="Premature conker.">Premature co ...
 [5] <a href="/wiki/Slap_and_tong." title="Slap and tong.">Slap and tong.</a>
 [6] <a href="/wiki/Absolute_casserole." title="Absolute casserole.">Absolute ...
 [7] <a href="/wiki/You%27ve_got_no_chutzpah." title="You've got no chutzpah. ...
 [8] <a href="/wiki/An_orderly_species." title="An orderly species.">An order ...
 [9] <a href="/wiki/Mr_Octopus_and_Pottyhands." title="Mr Octopus and Pottyha ...
[10] <a href="/wiki/Activate_Jamali." title="Activate Jamali.">Activate Jamal ...

Accessing HTML attributes

Accessing HTML attributes

read_html("https://taskmaster.fandom.com/wiki/Series_11") |>
  html_elements("td[align='center'] > span") |>
  html_element("a") |>
  html_attr("href")
 [1] "/wiki/It%27s_not_your_fault."          
 [2] "/wiki/The_Lure_of_the_Treacle_Puppies."
 [3] "/wiki/Run_up_a_tree_to_the_moon."      
 [4] "/wiki/Premature_conker."               
 [5] "/wiki/Slap_and_tong."                  
 [6] "/wiki/Absolute_casserole."             
 [7] "/wiki/You%27ve_got_no_chutzpah."       
 [8] "/wiki/An_orderly_species."             
 [9] "/wiki/Mr_Octopus_and_Pottyhands."      
[10] "/wiki/Activate_Jamali."                

Class activity

https://sta279-f23.github.io/class_activities/ca_lecture_19.html