Class activity

Beatles songs!

In this class activity, you will work with data on Beatles songs. Our goal is to analyze patterns in the songwriters, vocals, and themes.

The data is available in a table at

http://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles

  1. Scrape the table of Beatles songs from the Wikipedia article, and store it as a data frame called songs in R. You can ignore the “Other released songs” on the Wikipedia page.

  2. Now let’s cleanup the names of some of the columns. Rename the columns and select only the song name, album, writers, vocals, and year columns, so the output looks something like this:

songs
## # A tibble: 213 × 5
##    Song                            album                 writers    vocals  Year
##    <chr>                           <chr>                 <chr>      <chr>  <int>
##  1 "\"Across the Universe\"[e]"    Let It BePast Masters LennonMcC… Lennon  1969
##  2 "\"Act Naturally\""             Help!                 Johnny Ru… Starr   1965
##  3 "\"All I've Got to Do\""        With the Beatles      LennonMcC… Lennon  1963
##  4 "\"All My Loving\""             With the Beatles      LennonMcC… McCar…  1963
##  5 "\"All Together Now\""          Yellow Submarine      LennonMcC… McCar…  1969
##  6 "\"All You Need Is Love\"[f] #" Magical Mystery Tour  LennonMcC… Lennon  1967
##  7 "\"And I Love Her\""            A Hard Day's Night    LennonMcC… McCar…  1964
##  8 "\"And Your Bird Can Sing\""    Revolver              LennonMcC… Lennon  1966
##  9 "\"Anna (Go to Him)\""          Please Please Me      Arthur Al… Lennon  1963
## 10 "\"Another Girl\""              Help!                 LennonMcC… McCar…  1965
## # ℹ 203 more rows

One of our goals is to examine the Beatles song titles. Before we do, however, we need to clean up the song titles. Right now, our titles look like this:

head(songs$Song)
## [1] "\"Across the Universe\"[e]"    "\"Act Naturally\""            
## [3] "\"All I've Got to Do\""        "\"All My Loving\""            
## [5] "\"All Together Now\""          "\"All You Need Is Love\"[f] #"

We want to remove the quotes (the \") and the footnotes and links ([f], #, etc.).

  1. Use the str_remove_all function to remove all of the quotes \" from the song titles. Then use the str_extract_all function (with a judiciously chosen regular expression) to extract the portion of the song titles you want to keep (omitting the footnotes, links, etc). Think carefully about what to keep in the titles – generally, the words in parentheses are part of the title!

To create the correct regular expression, you may find that .+? is useful. In a regular expression, .+ is greedy: it will take the longest substring that matches. In contrast, .+? is reluctant: it will take the shortest substring that matches. For example:

str_extract("I want you (she's so heavy) (5)", ".+?(?= \\()")
## [1] "I want you"
str_extract("I want you (she's so heavy) (5)", ".+(?= \\()")
## [1] "I want you (she's so heavy)"

Now that we have cleaned up the song titles, let’s look at the words in those titles. The first thing we want to do is extract each of the individual words from each title. This is called tokenizing, and can be done with the unnest_tokens function in the tidytext package.

  1. Run the following code to extract each word from each song title. What does a row in the resulting data frame represent?
library(tidytext)
songs |>
  unnest_tokens(word, Song, drop=F)
## # A tibble: 722 × 6
##    Song                album                 writers          vocals  Year word 
##    <chr>               <chr>                 <chr>            <chr>  <int> <chr>
##  1 Across the Universe Let It BePast Masters LennonMcCartney  Lennon  1969 acro…
##  2 Across the Universe Let It BePast Masters LennonMcCartney  Lennon  1969 the  
##  3 Across the Universe Let It BePast Masters LennonMcCartney  Lennon  1969 univ…
##  4 Act Naturally       Help!                 Johnny RussellV… Starr   1965 act  
##  5 Act Naturally       Help!                 Johnny RussellV… Starr   1965 natu…
##  6 All I've Got to Do  With the Beatles      LennonMcCartney  Lennon  1963 all  
##  7 All I've Got to Do  With the Beatles      LennonMcCartney  Lennon  1963 i've 
##  8 All I've Got to Do  With the Beatles      LennonMcCartney  Lennon  1963 got  
##  9 All I've Got to Do  With the Beatles      LennonMcCartney  Lennon  1963 to   
## 10 All I've Got to Do  With the Beatles      LennonMcCartney  Lennon  1963 do   
## # ℹ 712 more rows

Once we’ve tokenized the song names, we notice that there are a lot of stop words in the titles, like the, a, and, to, etc. To remove these stop words, we will use the get_stopwords function in the tidytext package, which loads a set of stop words that we can ignore using an anti-join.

  1. Run the following code to remove stopwords from the tokenized song titles:
song_words <- songs |>
  unnest_tokens(word, Song, drop=F) |>
  anti_join(get_stopwords(), join_by("word"))

song_words
## # A tibble: 423 × 6
##    Song                 album                 writers         vocals  Year word 
##    <chr>                <chr>                 <chr>           <chr>  <int> <chr>
##  1 Across the Universe  Let It BePast Masters LennonMcCartney Lennon  1969 acro…
##  2 Across the Universe  Let It BePast Masters LennonMcCartney Lennon  1969 univ…
##  3 Act Naturally        Help!                 Johnny Russell… Starr   1965 act  
##  4 Act Naturally        Help!                 Johnny Russell… Starr   1965 natu…
##  5 All I've Got to Do   With the Beatles      LennonMcCartney Lennon  1963 got  
##  6 All My Loving        With the Beatles      LennonMcCartney McCar…  1963 lovi…
##  7 All Together Now     Yellow Submarine      LennonMcCartney McCar…  1969 toge…
##  8 All Together Now     Yellow Submarine      LennonMcCartney McCar…  1969 now  
##  9 All You Need Is Love Magical Mystery Tour  LennonMcCartney Lennon  1967 need 
## 10 All You Need Is Love Magical Mystery Tour  LennonMcCartney Lennon  1967 love 
## # ℹ 413 more rows

Ok, now we’ve got the words we want!

  1. What are the most common words that appear in Beatles song titles? Can you get a sense for what themes they tend to sing about?

We can do more analysis than just looking at the words. One approach to text analysis is sentiment analysis. In a sentiment analysis, each word is given a sentiment score (e.g., how positve or negative that word is).

The key to sentiment analysis are what are called lexicons: lists of English words, with each word assigned a numeric value identifying how positive or negative that word is. The lexicon we will use is the AFINN lexicon. To load this lexicon into R, use the following:

library(textdata)
afinn <- get_sentiments("afinn")
head(afinn)
## # A tibble: 6 × 2
##   word       value
##   <chr>      <dbl>
## 1 abandon       -2
## 2 abandoned     -2
## 3 abandons      -2
## 4 abducted      -2
## 5 abduction     -2
## 6 abductions    -2

The AFINN lexicon assigns words a score between -5 and 5. Negative scores indicate a negative sentiment and positive scores indicate a positive sentiment. The closer to 0 that a word is scored, the more neutral the word.

  1. Use a join to combine the afinn lexicon with your data frame containing all the song title words.

  2. On average, are Beatles song titles positive or negative?

  3. Calculate a sentiment score for each song by summing the sentiment scores for the words in the song title. Then plot a distribution of the sentiment scores for each song.

  4. Which songs have the most positive titles? The most negative?

If you happen to be familiar with the Beatles discography: do songs with positive titles also have positive lyrics?