Homework 6

Due: Friday, October 20, 11:00am on Canvas

Instructions:

  • Download the HW 6 template, and open the template (a Quarto document) in RStudio.
  • Put your name in the file header
  • Click Render
  • Type all code and answers in the document (using ### for section headings and #### for question headings)
  • Render early and often to catch any errors!
  • When you are finished, submit the final rendered HTML to Canvas

Code guidelines:

  • If a question requires code, and code is not provided, you will not receive full credit
  • You will be graded on the quality of your code. In addition to being correct, your code should also be easy to read
    • No magic numbers
    • Use descriptive names for your variables
    • Set seeds where needed
    • Comment code
    • If a block of code is being called multiple times, put it in a function

Resources: In addition to the class notes and activities, I recommend reading the following resources:

Baseball data

The Teams data in the Lahman package contains information on professional baseball teams since 1871.

Question 1

Using the Teams data, create a plot of the number of home runs scored (HR) and allowed (HRA) by the Chicago Cubs (teamID CHN) in each season. Your plot should look like close to this:

You may use whichever R functions you like to create the plot, but the axes and legend should be labeled as in the plot above.

Practice pivoting

Question 2

The code below creates a data frame called df1:

df1 <- data.frame(
  grp = c("A", "A", "B", "B"),
  sex = c("F", "M", "F", "M"),
  meanL = c(0.225, 0.47, 0.325, 0.547),
  sdL = c(0.106, 0.325, 0.106, 0.308),
  meanR = c(0.34, 0.57, 0.4, 0.647),
  sdR = c(0.0849, 0.325, 0.0707, 0.274)
)

df1
##   grp sex meanL   sdL meanR    sdR
## 1   A   F 0.225 0.106 0.340 0.0849
## 2   A   M 0.470 0.325 0.570 0.3250
## 3   B   F 0.325 0.106 0.400 0.0707
## 4   B   M 0.547 0.308 0.647 0.2740

Using pivot_longer and pivot_wider, reshape df1 to produce the following output:

## # A tibble: 2 × 9
##   grp   F.meanL F.sdL F.meanR  F.sdR M.meanL M.sdL M.meanR M.sdR
##   <chr>   <dbl> <dbl>   <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 A       0.225 0.106    0.34 0.0849   0.47  0.325   0.57  0.325
## 2 B       0.325 0.106    0.4  0.0707   0.547 0.308   0.647 0.274

Remember that the ? provides documentation in R. For example, running ?pivot_wider in the console gives helpful information about the pivot_wider function.

Practice pivoting

The code below creates a data frame called df2:

df2 <- data.frame(id = rep(c(1, 2, 3), 2),
                  group = rep(c("T", "C"), each=3),
                  vals = c(4, 6, 8, 5, 6, 10))

df2
##   id group vals
## 1  1     T    4
## 2  2     T    6
## 3  3     T    8
## 4  1     C    5
## 5  2     C    6
## 6  3     C   10

An analyst wants to calculate the pairwise differences between the Treatment (T) and Control (C) values for each individual in this dataset. They use the following code:

Treat <- filter(df2, group == "T")
Control <- filter(df2, group == "C")
all <- mutate(Treat, diff = Treat$vals - Control$vals)
all
##   id group vals diff
## 1  1     T    4   -1
## 2  2     T    6    0
## 3  3     T    8   -2

Question 3

Verify that this code works for this example and generates the correct values of -1, 0, and -2. Describe two problems that might arise if the data set is not sorted in a particular order or if one of the observations is missing for one of the subjects.

Question 4

Provide an alternative approach to generate the diff variable, using group_by and summarize to produce the following output:

## # A tibble: 3 × 2
##      id  diff
##   <dbl> <dbl>
## 1     1    -1
## 2     2     0
## 3     3    -2

Question 5

Provide an alternative approach to generate the diff variable that uses pivot_wider and mutate to produce the following output:

## # A tibble: 3 × 4
##      id     T     C  diff
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     4     5    -1
## 2     2     6     6     0
## 3     3     8    10    -2

Analyzing simulation results

Your friend ran simulations to compare three different methods for calculating 95% confidence intervals for \(\beta_1\) in a linear regression model. Their data is store in the file sim_results.csv, which can be imported into R as follows:

sim_results <- read.csv("https://sta279-f23.github.io/homework/sim_results.csv")

The sim_results dataset is organized as follows:

  • The first column, n, records the sample size
  • For each method and sample size, your friend simulated 100 datasets and calculated a confidence interval. The results for method 1 are stored in columns method1_1, method1_2, …, method1_100 (with similar columns for methods 2 and 3). A 1 in a column indicates that the confidence interval captured \(\beta_1\), while a 2 indicates the confidence interval did not capture \(\beta_1\).

Question 6

Plot the observed coverage for each method (the fraction of times the confidence interval contained \(\beta_1\)) as a function of sample size. Your final plot should look as close as possible to this:

You will likely need to use some combination of the following functions: pivot_longer, mutate, select, across, c_across, rowwise

Fitting linear regression models

The following code generates data \(X_i \sim N(0, 1)\), \(\varepsilon_i \sim N(0, 1)\), and

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

with \(\beta_0 = \beta_1 = 1\).

n <- 100
x <- rnorm(n)
noise <- rnorm(n)
y <- 1 + x + noise

To estimate \(\beta_0\) and \(\beta_1\), we can use the lm function:

lm(y ~ x)
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      1.0694       0.9578

How are these estimated coefficients calculated? Mathematically, the vector of estimated coefficients \(\widehat{\beta}\) is given by

\[\widehat{\beta} = (X^T X)^{-1} X^T Y\] where \(X\) is the design matrix and \(Y\) is the vector of observed responses. For the model above,

\[\widehat{\beta} = \begin{bmatrix} \widehat{\beta}_0 \\ \widehat{\beta}_1 \end{bmatrix} \hspace{0.5cm} X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \hspace{0.5cm} Y = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{bmatrix}\]

How would we do this math in R? Since \(X\) is a matrix, and \(\widehat{\beta}\) and \(Y\) are both vectors, we want to work with matrices and vectors in R!

  • The vector y generated above already represents \(Y\) (by default, vectors in R are column vectors)
  • We need to create the design matrix \(X\), by adding a column of 1s to the x simulated above. The following code creates \(X\):
X_mat <- cbind(1, x)
  • Matrix multiplication in R is done with %*%. For example, if A and B are conformal matrices/vectors, then \(AB\) would be calculated with A %*% B
  • Transposes are calculated with t(). E.g., \(A^T\) is given by t(A)
  • Matrix inverses are calculated with solve(). E.g., \(A^{-1}\) is given by solve(A)

Question 7

Using this information, write a function called my_lm which will calculate the estimated coefficients \(\widehat{\beta}\) given a vector of responses y and a design matrix X_mat.