Homework 7
Due: Friday, October 27, 11:00am on Canvas
Instructions:
- Download the HW 7 template, and open the template (a Quarto document) in RStudio.
- Put your name in the file header
- Click
Render
- Type all code and answers in the document (using
###
for section headings and####
for question headings) - Render early and often to catch any errors!
- When you are finished, submit the final rendered HTML to Canvas
Code guidelines:
- If a question requires code, and code is not provided, you will not receive full credit
- You will be graded on the quality of your code. In addition to being
correct, your code should also be easy to read
- No magic numbers
- Use descriptive names for your variables
- Set seeds where needed
- Comment code
- If a block of code is being called multiple times, put it in a function
Resources: In addition to the class notes and activities, I recommend reading the following resources:
- Chapter 4 and Chapter 6 in Modern Data Science with R
- Chapter 4 and Chapter 6 in R for Data Science (2nd edition)
Reshaping data in Python
The R code below creates a data frame called df1
:
df1 <- data.frame(
grp = c("A", "A", "B", "B"),
sex = c("F", "M", "F", "M"),
meanL = c(0.225, 0.47, 0.325, 0.547),
sdL = c(0.106, 0.325, 0.106, 0.308),
meanR = c(0.34, 0.57, 0.4, 0.647),
sdR = c(0.0849, 0.325, 0.0707, 0.274)
)
df1
## grp sex meanL sdL meanR sdR
## 1 A F 0.225 0.106 0.340 0.0849
## 2 A M 0.470 0.325 0.570 0.3250
## 3 B F 0.325 0.106 0.400 0.0707
## 4 B M 0.547 0.308 0.647 0.2740
Load the data into Python:
Question 1
Using melt
and pivot
, reshape
df1
to produce the following output:
## sex F M F M F M F M
## stat meanL meanL sdL sdL meanR meanR sdR sdR
## grp
## A 0.225 0.470 0.106 0.325 0.34 0.570 0.0849 0.325
## B 0.325 0.547 0.106 0.308 0.40 0.647 0.0707 0.274
Practice pivoting
The R code below creates a data frame called df2
:
df2 <- data.frame(id = rep(c(1, 2, 3), 2),
group = rep(c("T", "C"), each=3),
vals = c(4, 6, 8, 5, 6, 10))
df2
## id group vals
## 1 1 T 4
## 2 2 T 6
## 3 3 T 8
## 4 1 C 5
## 5 2 C 6
## 6 3 C 10
Import the data into Python:
Question 2
Use pivot
and assign
to reshape the data
into the following (you may need to create an intermediate data frame,
rather than doing all steps in a single chain):
## group C T diff
## id
## 1.0 5.0 4.0 -1.0
## 2.0 6.0 6.0 0.0
## 3.0 10.0 8.0 -2.0
Analyzing simulation results
Your friend ran simulations to compare three different methods for
calculating 95% confidence intervals for \(\beta_1\) in a linear regression model.
Their data is store in the file sim_results.csv
, which can
be imported into Python as follows:
The sim_results
dataset is organized as follows:
- The first column,
n
, records the sample size - For each method and sample size, your friend simulated 100 datasets
and calculated a confidence interval. The results for method 1 are
stored in columns
method1_1
,method1_2
, …,method1_100
(with similar columns for methods 2 and 3). A 1 in a column indicates that the confidence interval captured \(\beta_1\), while a 2 indicates the confidence interval did not capture \(\beta_1\).
Just like in the previous assignment, we want to plot coverage for each method as a function of sample size. The following questions will help guide you through one possible way of doing this in Python.
Question 3
Use the melt
function to reshape the
sim_results
data into the following form:
## n method result
## 0 5 method1_1 1
## 1 10 method1_1 1
## 2 20 method1_1 1
## 3 50 method1_1 1
## 4 100 method1_1 1
## ... ... ... ...
## 1495 5 method3_100 2
## 1496 10 method3_100 1
## 1497 20 method3_100 1
## 1498 50 method3_100 1
## 1499 100 method3_100 1
##
## [1500 rows x 3 columns]
Question 4
Next, use str.split
to split the method
column from Question 3 into two columns (see the slides and class
activities for examples):
## n method result iteration
## 0 5 method1 1 1
## 1 10 method1 1 1
## 2 20 method1 1 1
## 3 50 method1 1 1
## 4 100 method1 1 1
## ... ... ... ... ...
## 1495 5 method3 2 100
## 1496 10 method3 1 100
## 1497 20 method3 1 100
## 1498 50 method3 1 100
## 1499 100 method3 1 100
##
## [1500 rows x 4 columns]
Question 5
Next, use groupby
and agg
to calculate
coverage for each method, for each sample size:
## coverage
## n method
## 5 method1 0.44
## method2 0.45
## method3 0.25
## 10 method1 0.72
## method2 0.70
## method3 0.82
## 20 method1 0.84
## method2 0.80
## method3 0.93
## 50 method1 0.95
## method2 0.93
## method3 0.94
## 100 method1 0.96
## method2 0.92
## method3 0.96
Hint: In agg
, you can pass user-defined
functions. Consider a function that converts the results into 0s and 1s,
and then takes the mean.
Plotting the results
Finally, we want to plot the results! But before we do, we need to do just a bit more reformatting.
Currently, your results should look like this:
## coverage
## n method
## 5 method1 0.44
## method2 0.45
## method3 0.25
## 10 method1 0.72
## method2 0.70
## method3 0.82
## 20 method1 0.84
## method2 0.80
## method3 0.93
## 50 method1 0.95
## method2 0.93
## method3 0.94
## 100 method1 0.96
## method2 0.92
## method3 0.96
Let’s suppose this new data frame is called sim_res_new
,
and we want to extract the sample sizes. Normally we would do
sim_res_new['n']
. But see what happens when we try
that:
We get an error! Why? Because when we grouped and aggregated above, the columns that we grouped by became the index for our data frame. All pandas data frames have an index, though we haven’t used it so far.
For now, let’s get rid of n
and method
as
the index. One way is to add as_index = False
to your
groupby
call. Another way is to add
.reset_index()
at the very end of your chain.
Question 6
Modify your code so the output now looks like this:
## n method coverage
## 0 5 method1 0.44
## 1 5 method2 0.45
## 2 5 method3 0.25
## 3 10 method1 0.72
## 4 10 method2 0.70
## 5 10 method3 0.82
## 6 20 method1 0.84
## 7 20 method2 0.80
## 8 20 method3 0.93
## 9 50 method1 0.95
## 10 50 method2 0.93
## 11 50 method3 0.94
## 12 100 method1 0.96
## 13 100 method2 0.92
## 14 100 method3 0.96
Finally, we can create a plot. We will use the seaborn
and matplotlib
libraries to create a line plot like in the
previous homework.
A (very brief) intro to web scraping
So far in this course, we have worked with data that we have generated ourselves (e.g. through simulations), or that other people have collected. But sometimes, that data isn’t in a nice format for us to work with.
Sadly, a lot of data in the world doesn’t exist in an easily-downloaded CSV file. For example, suppose we want data on the outcome of basketball games in the NBA during the covid pandemic. We might use a site like BasketballReference.com:
https://www.basketball-reference.com/leagues/NBA_2021_games-december.html
There’s a “Share and Export” button on the page. But if we want to download the data for multiple months, then it would be a pain to manually download each table and import it into R.
Instead, we can download the data directly from the website! Websites
are built with Hyper Text Markup Language (HTML), and we can extract
information from the HTML directly. To do this, we will use the
rvest
package in R (which also loads the xml2
package).
Reading the html file
Run the following code in R to read the html file from the url:
library(tidyverse)
library(rvest)
december <- read_html("https://www.basketball-reference.com/leagues/NBA_2021_games-december.html")
december
## {html_document}
## <html data-version="klecko-" data-root="/home/bbr/build" lang="en" class="no-js">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="bbr">\n<div id="wrap">\n \n <div id="header" role="banner" ...
Looking at the output, it is hard to find what we’re looking for.
Fortunately, we know what we want: a table! We can use the
html_element
function to extract any tables from the file,
and then convert it to a data frame with the html_table
function:
## # A tibble: 67 × 11
## Date `Start (ET)` `Visitor/Neutral` PTS `Home/Neutral` PTS `` ``
## <chr> <chr> <chr> <int> <chr> <int> <chr> <chr>
## 1 Tue, D… 7:00p Golden State War… 99 Brooklyn Nets 125 Box … ""
## 2 Tue, D… 10:00p Los Angeles Clip… 116 Los Angeles L… 109 Box … ""
## 3 Wed, D… 7:00p Charlotte Hornets 114 Cleveland Cav… 121 Box … ""
## 4 Wed, D… 7:00p New York Knicks 107 Indiana Pacers 121 Box … ""
## 5 Wed, D… 7:00p Miami Heat 107 Orlando Magic 113 Box … ""
## 6 Wed, D… 7:00p Washington Wizar… 107 Philadelphia … 113 Box … ""
## 7 Wed, D… 7:30p New Orleans Peli… 113 Toronto Rapto… 99 Box … ""
## 8 Wed, D… 7:30p Milwaukee Bucks 121 Boston Celtics 122 Box … ""
## 9 Wed, D… 8:00p Atlanta Hawks 124 Chicago Bulls 104 Box … ""
## 10 Wed, D… 8:00p San Antonio Spurs 131 Memphis Grizz… 119 Box … ""
## # ℹ 57 more rows
## # ℹ 3 more variables: Attend. <chr>, Arena <chr>, Notes <lgl>
Now, what if we wanted to extract a different month? No problem! Just change the url:
read_html("https://www.basketball-reference.com/leagues/NBA_2021_games-january.html") |>
html_element("table") |>
html_table(header = TRUE, fill = TRUE)
## # A tibble: 222 × 11
## Date `Start (ET)` `Visitor/Neutral` PTS `Home/Neutral` PTS `` ``
## <chr> <chr> <chr> <int> <chr> <int> <chr> <chr>
## 1 Fri, J… 7:00p Memphis Grizzlies 108 Charlotte Hor… 93 Box … ""
## 2 Fri, J… 7:00p Miami Heat 83 Dallas Maveri… 93 Box … ""
## 3 Fri, J… 7:00p Boston Celtics 93 Detroit Pisto… 96 Box … ""
## 4 Fri, J… 7:30p Atlanta Hawks 114 Brooklyn Nets 96 Box … ""
## 5 Fri, J… 8:00p Chicago Bulls 96 Milwaukee Buc… 126 Box … ""
## 6 Fri, J… 8:00p Washington Wizar… 130 Minnesota Tim… 109 Box … ""
## 7 Fri, J… 8:00p Los Angeles Lake… 109 San Antonio S… 103 Box … ""
## 8 Fri, J… 9:00p Phoenix Suns 106 Denver Nuggets 103 Box … ""
## 9 Fri, J… 9:00p Los Angeles Clip… 100 Utah Jazz 106 Box … ""
## 10 Fri, J… 10:30p Portland Trail B… 123 Golden State … 98 Box … ""
## # ℹ 212 more rows
## # ℹ 3 more variables: Attend. <chr>, Arena <chr>, Notes <lgl>
Iterating over multiple months
Instead of doing each month by hand, let’s use a for
loop to iterate over multiple months.
Question 8
Create a vector, months
, containing the months from
December to July, in lower case:
## [1] "december" "january" "february" "march" "april" "may" "june"
## [8] "july"
You can do this manually, but the month
function in the
lubridate
package might be easier.
Question 9
Next, we want to make a vector containing all of the urls. This means
we need to insert each month into the url in turn. We will use the
paste0
function in R:
Run the code to create the vector of urls.
Now that we have our urls, let’s iterate.
Question 10
Use a for
loop to extract the results for each month
from BasketballReference.com, and store each resulting dataframe in a
list.
For example, your output might look like this:
## # A tibble: 67 × 11
## Date `Start (ET)` `Visitor/Neutral` PTS `Home/Neutral` PTS `` ``
## <chr> <chr> <chr> <int> <chr> <int> <chr> <chr>
## 1 Tue, D… 7:00p Golden State War… 99 Brooklyn Nets 125 Box … ""
## 2 Tue, D… 10:00p Los Angeles Clip… 116 Los Angeles L… 109 Box … ""
## 3 Wed, D… 7:00p Charlotte Hornets 114 Cleveland Cav… 121 Box … ""
## 4 Wed, D… 7:00p New York Knicks 107 Indiana Pacers 121 Box … ""
## 5 Wed, D… 7:00p Miami Heat 107 Orlando Magic 113 Box … ""
## 6 Wed, D… 7:00p Washington Wizar… 107 Philadelphia … 113 Box … ""
## 7 Wed, D… 7:30p New Orleans Peli… 113 Toronto Rapto… 99 Box … ""
## 8 Wed, D… 7:30p Milwaukee Bucks 121 Boston Celtics 122 Box … ""
## 9 Wed, D… 8:00p Atlanta Hawks 124 Chicago Bulls 104 Box … ""
## 10 Wed, D… 8:00p San Antonio Spurs 131 Memphis Grizz… 119 Box … ""
## # ℹ 57 more rows
## # ℹ 3 more variables: Attend. <chr>, Arena <chr>, Notes <lgl>
## # A tibble: 222 × 11
## Date `Start (ET)` `Visitor/Neutral` PTS `Home/Neutral` PTS `` ``
## <chr> <chr> <chr> <int> <chr> <int> <chr> <chr>
## 1 Fri, J… 7:00p Memphis Grizzlies 108 Charlotte Hor… 93 Box … ""
## 2 Fri, J… 7:00p Miami Heat 83 Dallas Maveri… 93 Box … ""
## 3 Fri, J… 7:00p Boston Celtics 93 Detroit Pisto… 96 Box … ""
## 4 Fri, J… 7:30p Atlanta Hawks 114 Brooklyn Nets 96 Box … ""
## 5 Fri, J… 8:00p Chicago Bulls 96 Milwaukee Buc… 126 Box … ""
## 6 Fri, J… 8:00p Washington Wizar… 130 Minnesota Tim… 109 Box … ""
## 7 Fri, J… 8:00p Los Angeles Lake… 109 San Antonio S… 103 Box … ""
## 8 Fri, J… 9:00p Phoenix Suns 106 Denver Nuggets 103 Box … ""
## 9 Fri, J… 9:00p Los Angeles Clip… 100 Utah Jazz 106 Box … ""
## 10 Fri, J… 10:30p Portland Trail B… 123 Golden State … 98 Box … ""
## # ℹ 212 more rows
## # ℹ 3 more variables: Attend. <chr>, Arena <chr>, Notes <lgl>
Question 11
Finally, let’s combine all the data frames in our list into a single data frame:
Now, looking at our output, we can see that there are two
columns with the name PTS
:
## [1] "Date" "Start (ET)" "Visitor/Neutral" "PTS"
## [5] "Home/Neutral" "PTS" "V1" "V2"
## [9] "Attend." "Arena" "Notes"
We will need to rename the columns to prevent issues in R:
colnames(nba) <- c("date", "start", "visitor", "visitor_pts", "home", "home_pts",
"v1", "v2", "attend", "arena", "notes")
I have kept the names of a couple of the columns as v1
and v2
for now, since we will drop those columns
anyway.
Now that we have our data, let’s start to analyze it. We would like to know whether having more fans in attendance gives the home team an advantage. However, there is another issue to fix:
## [1] "character"
Currently, attendance is a character! We need to make it a number before we can do statistics. This requires two steps: removing the commas from the attendances (R won’t understand them), then converting the attendances to a numeric vector:
Finally, let’s calculate some summary statistics and make some plots.
Question 12
Create a plot showing the relationship between attendance and the difference in scores between the home team and the visiting team (we may hypothesize that when the crowd is bigger, the home team will perform better).
Question 13
Create a plot showing how the fraction of games with 0 attendees changed over the course of the season.