In this class activity, you will work with data on Beatles songs. Our goal is to analyze patterns in the songwriters, vocals, and themes.
The data is available in a table at
http://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles
Scrape the table of Beatles songs from the Wikipedia article, and store it as a data frame called songs
in R. You can ignore the “Other released songs” on the Wikipedia page.
Now let’s cleanup the names of some of the columns. Rename the columns and select only the song name, album, writers, vocals, and year columns, so the output looks something like this:
## # A tibble: 213 × 5
## Song album writers vocals Year
## <chr> <chr> <chr> <chr> <int>
## 1 "\"Across the Universe\"[e]" Let It BePast Masters LennonMcC… Lennon 1969
## 2 "\"Act Naturally\"" Help! Johnny Ru… Starr 1965
## 3 "\"All I've Got to Do\"" With the Beatles LennonMcC… Lennon 1963
## 4 "\"All My Loving\"" With the Beatles LennonMcC… McCar… 1963
## 5 "\"All Together Now\"" Yellow Submarine LennonMcC… McCar… 1969
## 6 "\"All You Need Is Love\"[f] #" Magical Mystery Tour LennonMcC… Lennon 1967
## 7 "\"And I Love Her\"" A Hard Day's Night LennonMcC… McCar… 1964
## 8 "\"And Your Bird Can Sing\"" Revolver LennonMcC… Lennon 1966
## 9 "\"Anna (Go to Him)\"" Please Please Me Arthur Al… Lennon 1963
## 10 "\"Another Girl\"" Help! LennonMcC… McCar… 1965
## # ℹ 203 more rows
One of our goals is to examine the Beatles song titles. Before we do, however, we need to clean up the song titles. Right now, our titles look like this:
## [1] "\"Across the Universe\"[e]" "\"Act Naturally\""
## [3] "\"All I've Got to Do\"" "\"All My Loving\""
## [5] "\"All Together Now\"" "\"All You Need Is Love\"[f] #"
We want to remove the quotes (the \"
) and the footnotes and links ([f]
, #
, etc.).
str_remove_all
function to remove all of the quotes \"
from the song titles. Then use the str_extract_all
function (with a judiciously chosen regular expression) to extract the portion of the song titles you want to keep (omitting the footnotes, links, etc). Think carefully about what to keep in the titles – generally, the words in parentheses are part of the title!To create the correct regular expression, you may find that .+?
is useful. In a regular expression, .+
is greedy: it will take the longest substring that matches. In contrast, .+?
is reluctant: it will take the shortest substring that matches. For example:
## [1] "I want you"
## [1] "I want you (she's so heavy)"
Now that we have cleaned up the song titles, let’s look at the words in those titles. The first thing we want to do is extract each of the individual words from each title. This is called tokenizing, and can be done with the unnest_tokens
function in the tidytext
package.
## # A tibble: 722 × 6
## Song album writers vocals Year word
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 Across the Universe Let It BePast Masters LennonMcCartney Lennon 1969 acro…
## 2 Across the Universe Let It BePast Masters LennonMcCartney Lennon 1969 the
## 3 Across the Universe Let It BePast Masters LennonMcCartney Lennon 1969 univ…
## 4 Act Naturally Help! Johnny RussellV… Starr 1965 act
## 5 Act Naturally Help! Johnny RussellV… Starr 1965 natu…
## 6 All I've Got to Do With the Beatles LennonMcCartney Lennon 1963 all
## 7 All I've Got to Do With the Beatles LennonMcCartney Lennon 1963 i've
## 8 All I've Got to Do With the Beatles LennonMcCartney Lennon 1963 got
## 9 All I've Got to Do With the Beatles LennonMcCartney Lennon 1963 to
## 10 All I've Got to Do With the Beatles LennonMcCartney Lennon 1963 do
## # ℹ 712 more rows
Once we’ve tokenized the song names, we notice that there are a lot of stop words in the titles, like the
, a
, and
, to
, etc. To remove these stop words, we will use the get_stopwords
function in the tidytext
package, which loads a set of stop words that we can ignore using an anti-join.
song_words <- songs |>
unnest_tokens(word, Song, drop=F) |>
anti_join(get_stopwords(), join_by("word"))
song_words
## # A tibble: 423 × 6
## Song album writers vocals Year word
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 Across the Universe Let It BePast Masters LennonMcCartney Lennon 1969 acro…
## 2 Across the Universe Let It BePast Masters LennonMcCartney Lennon 1969 univ…
## 3 Act Naturally Help! Johnny Russell… Starr 1965 act
## 4 Act Naturally Help! Johnny Russell… Starr 1965 natu…
## 5 All I've Got to Do With the Beatles LennonMcCartney Lennon 1963 got
## 6 All My Loving With the Beatles LennonMcCartney McCar… 1963 lovi…
## 7 All Together Now Yellow Submarine LennonMcCartney McCar… 1969 toge…
## 8 All Together Now Yellow Submarine LennonMcCartney McCar… 1969 now
## 9 All You Need Is Love Magical Mystery Tour LennonMcCartney Lennon 1967 need
## 10 All You Need Is Love Magical Mystery Tour LennonMcCartney Lennon 1967 love
## # ℹ 413 more rows
Ok, now we’ve got the words we want!
We can do more analysis than just looking at the words. One approach to text analysis is sentiment analysis. In a sentiment analysis, each word is given a sentiment score (e.g., how positve or negative that word is).
The key to sentiment analysis are what are called lexicons: lists of English words, with each word assigned a numeric value identifying how positive or negative that word is. The lexicon we will use is the AFINN lexicon. To load this lexicon into R, use the following:
## # A tibble: 6 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
The AFINN lexicon assigns words a score between -5 and 5. Negative scores indicate a negative sentiment and positive scores indicate a positive sentiment. The closer to 0 that a word is scored, the more neutral the word.
Use a join to combine the afinn
lexicon with your data frame containing all the song title words.
On average, are Beatles song titles positive or negative?
Calculate a sentiment score for each song by summing the sentiment scores for the words in the song title. Then plot a distribution of the sentiment scores for each song.
Which songs have the most positive titles? The most negative?
If you happen to be familiar with the Beatles discography: do songs with positive titles also have positive lyrics?