Homework 6

Due: Friday, October 20, 11:00am on Canvas

Instructions:

Download the HW 6 template, and open the template (a Quarto document) in RStudio.
Put your name in the file header
Click Render
Type all code and answers in the document (using ### for section headings and #### for question headings)
Render early and often to catch any errors!
When you are finished, submit the final rendered HTML to Canvas

Code guidelines:

If a question requires code, and code is not provided, you will not receive full credit
You will be graded on the quality of your code. In addition to being correct, your code should also be easy to read
- No magic numbers
- Use descriptive names for your variables
- Set seeds where needed
- Comment code
- If a block of code is being called multiple times, put it in a function

Resources: In addition to the class notes and activities, I recommend reading the following resources:

Chapter 4 and Chapter 6 in Modern Data Science with R
Chapter 4 and Chapter 6 in R for Data Science (2nd edition)

Baseball data

The Teams data in the Lahman package contains information on professional baseball teams since 1871.

Question 1

Using the Teams data, create a plot of the number of home runs scored (HR) and allowed (HRA) by the Chicago Cubs (teamID CHN) in each season. Your plot should look like close to this:

You may use whichever R functions you like to create the plot, but the axes and legend should be labeled as in the plot above.

Practice pivoting

Question 2

The code below creates a data frame called df1:

df1 <- data.frame(
  grp = c("A", "A", "B", "B"),
  sex = c("F", "M", "F", "M"),
  meanL = c(0.225, 0.47, 0.325, 0.547),
  sdL = c(0.106, 0.325, 0.106, 0.308),
  meanR = c(0.34, 0.57, 0.4, 0.647),
  sdR = c(0.0849, 0.325, 0.0707, 0.274)
)

df1

##   grp sex meanL   sdL meanR    sdR
## 1   A   F 0.225 0.106 0.340 0.0849
## 2   A   M 0.470 0.325 0.570 0.3250
## 3   B   F 0.325 0.106 0.400 0.0707
## 4   B   M 0.547 0.308 0.647 0.2740

Using pivot_longer and pivot_wider, reshape df1 to produce the following output:

## # A tibble: 2 × 9
##   grp   F.meanL F.sdL F.meanR  F.sdR M.meanL M.sdL M.meanR M.sdR
##   <chr>   <dbl> <dbl>   <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 A       0.225 0.106    0.34 0.0849   0.47  0.325   0.57  0.325
## 2 B       0.325 0.106    0.4  0.0707   0.547 0.308   0.647 0.274

Remember that the ? provides documentation in R. For example, running ?pivot_wider in the console gives helpful information about the pivot_wider function.

Practice pivoting

The code below creates a data frame called df2:

df2 <- data.frame(id = rep(c(1, 2, 3), 2),
                  group = rep(c("T", "C"), each=3),
                  vals = c(4, 6, 8, 5, 6, 10))

df2

##   id group vals
## 1  1     T    4
## 2  2     T    6
## 3  3     T    8
## 4  1     C    5
## 5  2     C    6
## 6  3     C   10

An analyst wants to calculate the pairwise differences between the Treatment (T) and Control (C) values for each individual in this dataset. They use the following code:

Treat <- filter(df2, group == "T")
Control <- filter(df2, group == "C")
all <- mutate(Treat, diff = Treat$vals - Control$vals)
all

##   id group vals diff
## 1  1     T    4   -1
## 2  2     T    6    0
## 3  3     T    8   -2

Question 3

Verify that this code works for this example and generates the correct values of -1, 0, and -2. Describe two problems that might arise if the data set is not sorted in a particular order or if one of the observations is missing for one of the subjects.

Question 4

Provide an alternative approach to generate the diff variable, using group_by and summarize to produce the following output:

## # A tibble: 3 × 2
##      id  diff
##   <dbl> <dbl>
## 1     1    -1
## 2     2     0
## 3     3    -2

Question 5

Provide an alternative approach to generate the diff variable that uses pivot_wider and mutate to produce the following output:

## # A tibble: 3 × 4
##      id     T     C  diff
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     4     5    -1
## 2     2     6     6     0
## 3     3     8    10    -2

Analyzing simulation results

Your friend ran simulations to compare three different methods for calculating 95% confidence intervals for \(\beta_1\) in a linear regression model. Their data is store in the file sim_results.csv, which can be imported into R as follows:

sim_results <- read.csv("https://sta279-f23.github.io/homework/sim_results.csv")

The sim_results dataset is organized as follows:

The first column, n, records the sample size
For each method and sample size, your friend simulated 100 datasets and calculated a confidence interval. The results for method 1 are stored in columns method1_1, method1_2, …, method1_100 (with similar columns for methods 2 and 3). A 1 in a column indicates that the confidence interval captured \(\beta_1\), while a 2 indicates the confidence interval did not capture \(\beta_1\).

Question 6

Plot the observed coverage for each method (the fraction of times the confidence interval contained \(\beta_1\)) as a function of sample size. Your final plot should look as close as possible to this:

You will likely need to use some combination of the following functions: pivot_longer, mutate, select, across, c_across, rowwise

Fitting linear regression models

The following code generates data \(X_i \sim N(0, 1)\), \(\varepsilon_i \sim N(0, 1)\), and

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

with \(\beta_0 = \beta_1 = 1\).

n <- 100
x <- rnorm(n)
noise <- rnorm(n)
y <- 1 + x + noise

To estimate \(\beta_0\) and \(\beta_1\), we can use the lm function:

lm(y ~ x)

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      1.0694       0.9578

How are these estimated coefficients calculated? Mathematically, the vector of estimated coefficients \(\widehat{\beta}\) is given by

\[\widehat{\beta} = (X^T X)^{-1} X^T Y\] where \(X\) is the design matrix and \(Y\) is the vector of observed responses. For the model above,

\[\widehat{\beta} = \begin{bmatrix} \widehat{\beta}_0 \\ \widehat{\beta}_1 \end{bmatrix} \hspace{0.5cm} X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \hspace{0.5cm} Y = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{bmatrix}\]

How would we do this math in R? Since \(X\) is a matrix, and \(\widehat{\beta}\) and \(Y\) are both vectors, we want to work with matrices and vectors in R!

The vector y generated above already represents \(Y\) (by default, vectors in R are column vectors)
We need to create the design matrix \(X\), by adding a column of 1s to the x simulated above. The following code creates \(X\):

X_mat <- cbind(1, x)

Matrix multiplication in R is done with %*%. For example, if A and B are conformal matrices/vectors, then \(AB\) would be calculated with A %*% B
Transposes are calculated with t(). E.g., \(A^T\) is given by t(A)
Matrix inverses are calculated with solve(). E.g., \(A^{-1}\) is given by solve(A)

Question 7

Using this information, write a function called my_lm which will calculate the estimated coefficients \(\widehat{\beta}\) given a vector of responses y and a design matrix X_mat.