Homework 3

Due: Friday, September 22, 11:00am on Canvas

Instructions:

  • Download the HW 3 template, and open the template (a Quarto document) in RStudio.
  • Put your name in the file header
  • Click Render
  • Type all code and answers in the document (using ### for section headings and #### for question headings)
  • Render early and often to catch any errors!
  • When you are finished, submit the final rendered HTML to Canvas

Code guidelines:

  • If a question requires code, and code is not provided, you will not receive full credit
  • You will be graded on the quality of your code. In addition to being correct, your code should also be easy to read
    • No magic numbers
    • Use descriptive names for your variables
    • Set seeds where needed
    • Comment code
    • If a block of code is being called multiple times, put it in a function

Resources: In addition to the class notes and activities, I recommend reading the following resources:

More practice with functions

Neural networks are a way to learn complex prediction models. Fundamentally, a neural network works by passing input data through a series of nodes; the output of one layer of nodes is the input for the next layer. Each time the data goes through a node, an activation function is applied to transform the output (this allows the network to model nonlinear relationships).

Common activation functions include the ReLU (rectified linear unit):

\[f(x) = \begin{cases} x & x > 0 \\ 0 & x \leq 0 \end{cases}\] and the leaky ReLU, with parameter \(a\):

\[f_a(x) = \begin{cases} x & x > 0 \\ a \cdot x & x \leq 0 \end{cases}\] Indeed, the ReLU could be considered a special case of the leaky ReLU with \(a = 0\).

Here is an implementation of the ReLU function in R:

relu <- function(x){
  if(x > 0){
    return(x)
  } else {
    return(0)
  }
}

This relu function works for single inputs:

relu(1)
## [1] 1
relu(-1)
## [1] 0

However, it does not work for vectors of length greater than 1:

relu(c(1, -1))
## Error in if (x > 0) {: the condition has length > 1

The issue here is that if(x > 0) in the if...else... statement is not vectorized. That is, R is expecting a single true or false, not a vector. As in HW 2, to vectorize this function we can use the ifelse function (which IS vectorized).

Question 1

Re-write the relu function above, using the ifelse function, so that relu can be applied to vectors.

Question 2

Adapt your relu function from Question 1 to create a new function, leaky_relu, which takes TWO inputs, \(x\) and \(a\), and returns \(f_a(x)\) as defined above. Make \(a = 0\) the default value.

Practice with lists

In class, we learned about lists as a hierarchical structure for storing data. In these questions, we will practice more with lists and list indexing.

Question 3

Create a list, x, such that:

  • x[[2]][[1]] stores the function rnorm. So, x[[2]][[1]](3) should sample 3 numbers from a \(N(0, 1)\) distribution.
  • x[[1]] stores the function rexp. So, x[[1]](3) should sample 3 numbers from an Exponential distribution with rate 1.
  • x[[3]][[1]] stores an anonymous function which samples from a \(\chi^2_2\) distribution. So, x[[3]][[1]](3) should sample 3 numbers from a \(\chi^2_2\) distribution.

Logistic regression

As you learned in STA 112, linear regression is used with quantitative responses, whereas logistic regression is used for binary responses (\(Y_i = 0\) or \(1\)). In particular, if \(\pi_i = P(Y_i = 1)\), then a simple logistic regression model (with explanatory variable \(X_i\)) is \[\log \left( \frac{\pi_i}{1 - \pi_i} \right) = \beta_0 + \beta_1 X_i\]

As an example, consider a dataset of 3402 pitches thrown by MLB pitcher Clayton Kershaw in the 2013 season. The data is contained in the Kershaw data set, in the Stat2Data R package. We will focus on two specific variables for each pitch:

  • Result: a negative result (a ball or a hit), or a positive result (a strike or an out)
  • EndSpeed: the speed at which the ball crossed home plate (in mph)

Our goal is to investigate the relationship between pitch speed and result. We can fit a logistic regression model, with EndSpeed as the explanatory variable and Result as the response:

library(tidyverse)
library(Stat2Data)
data("Kershaw")

log_reg <- glm(Result ~ EndSpeed, family = binomial, data = Kershaw)
summary(log_reg)
## 
## Call:
## glm(formula = Result ~ EndSpeed, family = binomial, data = Kershaw)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.432279   0.488616  -2.931 0.003375 ** 
## EndSpeed     0.023360   0.006019   3.881 0.000104 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4540.4  on 3401  degrees of freedom
## Residual deviance: 4525.4  on 3400  degrees of freedom
## AIC: 4529.4
## 
## Number of Fisher Scoring iterations: 4

Here, log_reg is the resulting logistic regression model in R. What type of object is log_reg?

Question 4

Use the typeof function to determine what type of object log_reg is.

Question 5

Use the length function to determine how many components are in the top level of log_reg. (There’s a lot!)

Question 6

Let’s look at the first component: use log_reg[[1]] to see what it contains. What is stored here?

How were we meant to know that the first component of log_reg contained the coefficients? We probably wouldn’t! Fortunately, R offers the option to name the components of a list, which makes it easier to access them without remembering all their indices. To get the coefficients, for example, we can run the following:

log_reg[["coefficients"]]
## (Intercept)    EndSpeed 
## -1.43227907  0.02336011

Another way of accessing named entries in a list is with the $:

log_reg$coefficients
## (Intercept)    EndSpeed 
## -1.43227907  0.02336011

We can see all the names for a list with the names function:

names(log_reg)
##  [1] "coefficients"      "residuals"         "fitted.values"    
##  [4] "effects"           "R"                 "rank"             
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"    
## [13] "iter"              "weights"           "prior.weights"    
## [16] "df.residual"       "df.null"           "y"                
## [19] "converged"         "boundary"          "model"            
## [22] "call"              "formula"           "terms"            
## [25] "data"              "offset"            "control"          
## [28] "method"            "contrasts"         "xlevels"

This makes it easy to see what information the list contains!

Question 7

How would you extract the AIC for your logistic regression model, without using the summary function? Write the code here.

Logistic regression diagnostics

So far, so good! But, how do we assess the assumptions for this logistic regression model? The usual residual plots (like we would make for linear regression) are not helpful:

data.frame(fitted = log_reg$fitted.values,
           residuals = log_reg$y - log_reg$fitted.values) |>
  ggplot(aes(x = fitted, y = residuals)) +
  geom_abline(slope = 0, intercept = 0, color = "blue") +
  geom_point() +
  labs(x = "Fitted values", y = "Residuals") +
  theme_bw()

Instead of the raw residuals \(Y_i - \widehat{\pi}_i\), it is common in logistic regression to use a type of residual called randomized quantile residuals. For logistic regression, a randomized quantile residual is defined as follows:

\[r_{Q}(Y_i, \widehat{\pi}_i) = \Phi^{-1}(u) \hspace{1cm} u \sim \begin{cases} Uniform(1 - \widehat{\pi}_i, 1) & Y_i = 1 \\ Uniform(0, 1 - \widehat{\pi}_i) & Y_i = 0, \end{cases}\]

In other words:

  • If \(Y_i = 1\), draw a random number \(u\) from a \(Uniform(1 - \widehat{\pi}_i, 1)\) distribution
    • Else (\(Y_i = 0\)), draw a random number \(u\) from a \(Uniform(0, 1 - \widehat{\pi}_i)\) distribution
  • Then return \(\Phi^{-1}(u)\), where \(\Phi^{-1}\) is the quantile function for a standard normal distribution

How do we do this? Some hints:

  • log_reg$fitted.values will return a vector of \(\widehat{\pi}_i\)’s for the observed data
  • log_reg$y will return a vector of \(Y_i\)’s for the observed data
  • runif(...) will sample from a uniform distribution
  • ifelse can be used to handle conditional statements
  • In R, \(\Phi^{-1}(u)\) would be calculated with qnorm(u)

Question 8

Using these hints, write a function quant_resid that calculates randomized quantile residuals for a logistic regression model. Your function should take as input a logistic regression model (output of a call to glm in R), and return a vector of randomized quantile residuals (one residual for each observation in the data).

You may use a for loop, or use vectorized functions in R. You may not use any existing quantile residual functions from R packages.

Quantile residuals, continued: plotting

Above, you wrote a function quant_resid to generate randomized quantile residuals for a logistic regression model. Next, you will use your quant_resid function to write a new function, which produces a quantile residual plot for a logistic regression model.

Question 9

Write a function, quant_resid_plot, which takes as input a fitted logistic regression model (the output from glm in R), and produces a quantile residual plot.

  • The quantile residuals should be plotted on the y-axis
  • The fitted values (predicted probabilities) from the logistic regression model should be plotted on the x-axis
  • You should add a horizontal line at \(y = 0\)
  • The axes should be labeled appropriately
  • You may use either base R or ggplot for making the plot

Here is sample output for the Kershaw data (note: these are randomized quantile residuals, so it is ok if your plot looks slightly different):

library(tidyverse)
library(Stat2Data)
data("Kershaw")

log_reg <- glm(Result ~ EndSpeed, family = binomial, data = Kershaw)

quant_resid_plot(log_reg)

Interpreting the quantile residual plot is similar to interpreting residual plots for linear regression. If the logistic regression assumptions are satisfied, then the quantile residuals should be randomly scattered around the horizontal line at 0, with no clear pattern.