Homework 7

Due: Friday, October 27, 11:00am on Canvas

Instructions:

Download the HW 7 template, and open the template (a Quarto document) in RStudio.
Put your name in the file header
Click Render
Type all code and answers in the document (using ### for section headings and #### for question headings)
Render early and often to catch any errors!
When you are finished, submit the final rendered HTML to Canvas

Code guidelines:

If a question requires code, and code is not provided, you will not receive full credit
You will be graded on the quality of your code. In addition to being correct, your code should also be easy to read
- No magic numbers
- Use descriptive names for your variables
- Set seeds where needed
- Comment code
- If a block of code is being called multiple times, put it in a function

Resources: In addition to the class notes and activities, I recommend reading the following resources:

Chapter 4 and Chapter 6 in Modern Data Science with R
Chapter 4 and Chapter 6 in R for Data Science (2nd edition)

Reshaping data in Python

The R code below creates a data frame called df1:

df1 <- data.frame(
  grp = c("A", "A", "B", "B"),
  sex = c("F", "M", "F", "M"),
  meanL = c(0.225, 0.47, 0.325, 0.547),
  sdL = c(0.106, 0.325, 0.106, 0.308),
  meanR = c(0.34, 0.57, 0.4, 0.647),
  sdR = c(0.0849, 0.325, 0.0707, 0.274)
)

df1

##   grp sex meanL   sdL meanR    sdR
## 1   A   F 0.225 0.106 0.340 0.0849
## 2   A   M 0.470 0.325 0.570 0.3250
## 3   B   F 0.325 0.106 0.400 0.0707
## 4   B   M 0.547 0.308 0.647 0.2740

Load the data into Python:

import pandas as pd

df1 = r.df1

Question 1

Using melt and pivot, reshape df1 to produce the following output:

## sex       F      M      F      M     F      M       F      M
## stat  meanL  meanL    sdL    sdL meanR  meanR     sdR    sdR
## grp                                                         
## A     0.225  0.470  0.106  0.325  0.34  0.570  0.0849  0.325
## B     0.325  0.547  0.106  0.308  0.40  0.647  0.0707  0.274

Practice pivoting

The R code below creates a data frame called df2:

df2 <- data.frame(id = rep(c(1, 2, 3), 2),
                  group = rep(c("T", "C"), each=3),
                  vals = c(4, 6, 8, 5, 6, 10))

df2

##   id group vals
## 1  1     T    4
## 2  2     T    6
## 3  3     T    8
## 4  1     C    5
## 5  2     C    6
## 6  3     C   10

Import the data into Python:

df2 = r.df2

Question 2

Use pivot and assign to reshape the data into the following (you may need to create an intermediate data frame, rather than doing all steps in a single chain):

## group     C    T  diff
## id                    
## 1.0     5.0  4.0  -1.0
## 2.0     6.0  6.0   0.0
## 3.0    10.0  8.0  -2.0

Analyzing simulation results

Your friend ran simulations to compare three different methods for calculating 95% confidence intervals for $\beta_1$ in a linear regression model. Their data is store in the file sim_results.csv, which can be imported into Python as follows:

sim_results = pd.read_csv("https://sta279-f23.github.io/homework/sim_results.csv")

The sim_results dataset is organized as follows:

The first column, n, records the sample size
For each method and sample size, your friend simulated 100 datasets and calculated a confidence interval. The results for method 1 are stored in columns method1_1, method1_2, …, method1_100 (with similar columns for methods 2 and 3). A 1 in a column indicates that the confidence interval captured $\beta_1$ , while a 2 indicates the confidence interval did not capture $\beta_1$ .

Just like in the previous assignment, we want to plot coverage for each method as a function of sample size. The following questions will help guide you through one possible way of doing this in Python.

Question 3

Use the melt function to reshape the sim_results data into the following form:

##         n       method  result
## 0       5    method1_1       1
## 1      10    method1_1       1
## 2      20    method1_1       1
## 3      50    method1_1       1
## 4     100    method1_1       1
## ...   ...          ...     ...
## 1495    5  method3_100       2
## 1496   10  method3_100       1
## 1497   20  method3_100       1
## 1498   50  method3_100       1
## 1499  100  method3_100       1
## 
## [1500 rows x 3 columns]

Question 4

Next, use str.split to split the method column from Question 3 into two columns (see the slides and class activities for examples):

##         n   method  result iteration
## 0       5  method1       1         1
## 1      10  method1       1         1
## 2      20  method1       1         1
## 3      50  method1       1         1
## 4     100  method1       1         1
## ...   ...      ...     ...       ...
## 1495    5  method3       2       100
## 1496   10  method3       1       100
## 1497   20  method3       1       100
## 1498   50  method3       1       100
## 1499  100  method3       1       100
## 
## [1500 rows x 4 columns]

Question 5

Next, use groupby and agg to calculate coverage for each method, for each sample size:

##              coverage
## n   method           
## 5   method1      0.44
##     method2      0.45
##     method3      0.25
## 10  method1      0.72
##     method2      0.70
##     method3      0.82
## 20  method1      0.84
##     method2      0.80
##     method3      0.93
## 50  method1      0.95
##     method2      0.93
##     method3      0.94
## 100 method1      0.96
##     method2      0.92
##     method3      0.96

Hint: In agg, you can pass user-defined functions. Consider a function that converts the results into 0s and 1s, and then takes the mean.

Plotting the results

Finally, we want to plot the results! But before we do, we need to do just a bit more reformatting.

Currently, your results should look like this:

##              coverage
## n   method           
## 5   method1      0.44
##     method2      0.45
##     method3      0.25
## 10  method1      0.72
##     method2      0.70
##     method3      0.82
## 20  method1      0.84
##     method2      0.80
##     method3      0.93
## 50  method1      0.95
##     method2      0.93
##     method3      0.94
## 100 method1      0.96
##     method2      0.92
##     method3      0.96

Let’s suppose this new data frame is called sim_res_new, and we want to extract the sample sizes. Normally we would do sim_res_new['n']. But see what happens when we try that:

sim_res_new['n']

We get an error! Why? Because when we grouped and aggregated above, the columns that we grouped by became the index for our data frame. All pandas data frames have an index, though we haven’t used it so far.

For now, let’s get rid of n and method as the index. One way is to add as_index = False to your groupby call. Another way is to add .reset_index() at the very end of your chain.

Question 6

Modify your code so the output now looks like this:

##       n   method  coverage
## 0     5  method1      0.44
## 1     5  method2      0.45
## 2     5  method3      0.25
## 3    10  method1      0.72
## 4    10  method2      0.70
## 5    10  method3      0.82
## 6    20  method1      0.84
## 7    20  method2      0.80
## 8    20  method3      0.93
## 9    50  method1      0.95
## 10   50  method2      0.93
## 11   50  method3      0.94
## 12  100  method1      0.96
## 13  100  method2      0.92
## 14  100  method3      0.96

Finally, we can create a plot. We will use the seaborn and matplotlib libraries to create a line plot like in the previous homework.

Question 7

Use the following code to create the plot:

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure()
sns.lineplot(x="n", y="coverage", data=sim_res_new, hue="method", marker ='o')
plt.show()

A (very brief) intro to web scraping

So far in this course, we have worked with data that we have generated ourselves (e.g. through simulations), or that other people have collected. But sometimes, that data isn’t in a nice format for us to work with.

Sadly, a lot of data in the world doesn’t exist in an easily-downloaded CSV file. For example, suppose we want data on the outcome of basketball games in the NBA during the covid pandemic. We might use a site like BasketballReference.com:

https://www.basketball-reference.com/leagues/NBA_2021_games-december.html

There’s a “Share and Export” button on the page. But if we want to download the data for multiple months, then it would be a pain to manually download each table and import it into R.

Instead, we can download the data directly from the website! Websites are built with Hyper Text Markup Language (HTML), and we can extract information from the HTML directly. To do this, we will use the rvest package in R (which also loads the xml2 package).

Reading the html file

Run the following code in R to read the html file from the url:

library(tidyverse)
library(rvest)

december <- read_html("https://www.basketball-reference.com/leagues/NBA_2021_games-december.html")
december

## {html_document}
## <html data-version="klecko-" data-root="/home/bbr/build" lang="en" class="no-js">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="bbr">\n<div id="wrap">\n  \n  <div id="header" role="banner" ...

Looking at the output, it is hard to find what we’re looking for. Fortunately, we know what we want: a table! We can use the html_element function to extract any tables from the file, and then convert it to a data frame with the html_table function:

december |>
  html_element("table") |>
  html_table(header = TRUE, fill = TRUE)

## # A tibble: 67 × 11
##    Date    `Start (ET)` `Visitor/Neutral`   PTS `Home/Neutral`   PTS ``    ``   
##    <chr>   <chr>        <chr>             <int> <chr>          <int> <chr> <chr>
##  1 Tue, D… 7:00p        Golden State War…    99 Brooklyn Nets    125 Box … ""   
##  2 Tue, D… 10:00p       Los Angeles Clip…   116 Los Angeles L…   109 Box … ""   
##  3 Wed, D… 7:00p        Charlotte Hornets   114 Cleveland Cav…   121 Box … ""   
##  4 Wed, D… 7:00p        New York Knicks     107 Indiana Pacers   121 Box … ""   
##  5 Wed, D… 7:00p        Miami Heat          107 Orlando Magic    113 Box … ""   
##  6 Wed, D… 7:00p        Washington Wizar…   107 Philadelphia …   113 Box … ""   
##  7 Wed, D… 7:30p        New Orleans Peli…   113 Toronto Rapto…    99 Box … ""   
##  8 Wed, D… 7:30p        Milwaukee Bucks     121 Boston Celtics   122 Box … ""   
##  9 Wed, D… 8:00p        Atlanta Hawks       124 Chicago Bulls    104 Box … ""   
## 10 Wed, D… 8:00p        San Antonio Spurs   131 Memphis Grizz…   119 Box … ""   
## # ℹ 57 more rows
## # ℹ 3 more variables: Attend. <chr>, Arena <chr>, Notes <lgl>

Now, what if we wanted to extract a different month? No problem! Just change the url:

read_html("https://www.basketball-reference.com/leagues/NBA_2021_games-january.html") |>
  html_element("table") |>
  html_table(header = TRUE, fill = TRUE)

## # A tibble: 222 × 11
##    Date    `Start (ET)` `Visitor/Neutral`   PTS `Home/Neutral`   PTS ``    ``   
##    <chr>   <chr>        <chr>             <int> <chr>          <int> <chr> <chr>
##  1 Fri, J… 7:00p        Memphis Grizzlies   108 Charlotte Hor…    93 Box … ""   
##  2 Fri, J… 7:00p        Miami Heat           83 Dallas Maveri…    93 Box … ""   
##  3 Fri, J… 7:00p        Boston Celtics       93 Detroit Pisto…    96 Box … ""   
##  4 Fri, J… 7:30p        Atlanta Hawks       114 Brooklyn Nets     96 Box … ""   
##  5 Fri, J… 8:00p        Chicago Bulls        96 Milwaukee Buc…   126 Box … ""   
##  6 Fri, J… 8:00p        Washington Wizar…   130 Minnesota Tim…   109 Box … ""   
##  7 Fri, J… 8:00p        Los Angeles Lake…   109 San Antonio S…   103 Box … ""   
##  8 Fri, J… 9:00p        Phoenix Suns        106 Denver Nuggets   103 Box … ""   
##  9 Fri, J… 9:00p        Los Angeles Clip…   100 Utah Jazz        106 Box … ""   
## 10 Fri, J… 10:30p       Portland Trail B…   123 Golden State …    98 Box … ""   
## # ℹ 212 more rows
## # ℹ 3 more variables: Attend. <chr>, Arena <chr>, Notes <lgl>

Iterating over multiple months

Instead of doing each month by hand, let’s use a for loop to iterate over multiple months.

Question 8

Create a vector, months, containing the months from December to July, in lower case:

months

## [1] "december" "january"  "february" "march"    "april"    "may"      "june"    
## [8] "july"

You can do this manually, but the month function in the lubridate package might be easier.

Question 9

Next, we want to make a vector containing all of the urls. This means we need to insert each month into the url in turn. We will use the paste0 function in R:

urls <- paste0("https://www.basketball-reference.com/leagues/NBA_2021_games-",
               months, ".html")

Run the code to create the vector of urls.

Now that we have our urls, let’s iterate.

Question 10

Use a for loop to extract the results for each month from BasketballReference.com, and store each resulting dataframe in a list.

For example, your output might look like this:

df_list[[1]]

## # A tibble: 67 × 11
##    Date    `Start (ET)` `Visitor/Neutral`   PTS `Home/Neutral`   PTS ``    ``   
##    <chr>   <chr>        <chr>             <int> <chr>          <int> <chr> <chr>
##  1 Tue, D… 7:00p        Golden State War…    99 Brooklyn Nets    125 Box … ""   
##  2 Tue, D… 10:00p       Los Angeles Clip…   116 Los Angeles L…   109 Box … ""   
##  3 Wed, D… 7:00p        Charlotte Hornets   114 Cleveland Cav…   121 Box … ""   
##  4 Wed, D… 7:00p        New York Knicks     107 Indiana Pacers   121 Box … ""   
##  5 Wed, D… 7:00p        Miami Heat          107 Orlando Magic    113 Box … ""   
##  6 Wed, D… 7:00p        Washington Wizar…   107 Philadelphia …   113 Box … ""   
##  7 Wed, D… 7:30p        New Orleans Peli…   113 Toronto Rapto…    99 Box … ""   
##  8 Wed, D… 7:30p        Milwaukee Bucks     121 Boston Celtics   122 Box … ""   
##  9 Wed, D… 8:00p        Atlanta Hawks       124 Chicago Bulls    104 Box … ""   
## 10 Wed, D… 8:00p        San Antonio Spurs   131 Memphis Grizz…   119 Box … ""   
## # ℹ 57 more rows
## # ℹ 3 more variables: Attend. <chr>, Arena <chr>, Notes <lgl>

df_list[[2]]

## # A tibble: 222 × 11
##    Date    `Start (ET)` `Visitor/Neutral`   PTS `Home/Neutral`   PTS ``    ``   
##    <chr>   <chr>        <chr>             <int> <chr>          <int> <chr> <chr>
##  1 Fri, J… 7:00p        Memphis Grizzlies   108 Charlotte Hor…    93 Box … ""   
##  2 Fri, J… 7:00p        Miami Heat           83 Dallas Maveri…    93 Box … ""   
##  3 Fri, J… 7:00p        Boston Celtics       93 Detroit Pisto…    96 Box … ""   
##  4 Fri, J… 7:30p        Atlanta Hawks       114 Brooklyn Nets     96 Box … ""   
##  5 Fri, J… 8:00p        Chicago Bulls        96 Milwaukee Buc…   126 Box … ""   
##  6 Fri, J… 8:00p        Washington Wizar…   130 Minnesota Tim…   109 Box … ""   
##  7 Fri, J… 8:00p        Los Angeles Lake…   109 San Antonio S…   103 Box … ""   
##  8 Fri, J… 9:00p        Phoenix Suns        106 Denver Nuggets   103 Box … ""   
##  9 Fri, J… 9:00p        Los Angeles Clip…   100 Utah Jazz        106 Box … ""   
## 10 Fri, J… 10:30p       Portland Trail B…   123 Golden State …    98 Box … ""   
## # ℹ 212 more rows
## # ℹ 3 more variables: Attend. <chr>, Arena <chr>, Notes <lgl>

Question 11

Finally, let’s combine all the data frames in our list into a single data frame:

library(data.table)

nba <- rbindlist(df_list)

Now, looking at our output, we can see that there are two columns with the name PTS:

colnames(nba)

##  [1] "Date"            "Start (ET)"      "Visitor/Neutral" "PTS"            
##  [5] "Home/Neutral"    "PTS"             "V1"              "V2"             
##  [9] "Attend."         "Arena"           "Notes"

We will need to rename the columns to prevent issues in R:

colnames(nba) <- c("date", "start", "visitor", "visitor_pts", "home", "home_pts",
                   "v1", "v2", "attend", "arena", "notes")

I have kept the names of a couple of the columns as v1 and v2 for now, since we will drop those columns anyway.

Now that we have our data, let’s start to analyze it. We would like to know whether having more fans in attendance gives the home team an advantage. However, there is another issue to fix:

class(nba$attend)

## [1] "character"

Currently, attendance is a character! We need to make it a number before we can do statistics. This requires two steps: removing the commas from the attendances (R won’t understand them), then converting the attendances to a numeric vector:

nba <- nba |>
  mutate(attend = as.numeric(gsub(",", "", attend)))

Finally, let’s calculate some summary statistics and make some plots.

Question 12

Create a plot showing the relationship between attendance and the difference in scores between the home team and the visiting team (we may hypothesize that when the crowd is bigger, the home team will perform better).

Question 13

Create a plot showing how the fraction of games with 0 attendees changed over the course of the season.