Homework 3
Due: Friday, September 22, 11:00am on Canvas
Instructions:
- Download the HW 3 template, and open the template (a Quarto document) in RStudio.
- Put your name in the file header
- Click
Render
- Type all code and answers in the document (using
###
for section headings and####
for question headings) - Render early and often to catch any errors!
- When you are finished, submit the final rendered HTML to Canvas
Code guidelines:
- If a question requires code, and code is not provided, you will not receive full credit
- You will be graded on the quality of your code. In addition to being
correct, your code should also be easy to read
- No magic numbers
- Use descriptive names for your variables
- Set seeds where needed
- Comment code
- If a block of code is being called multiple times, put it in a function
Resources: In addition to the class notes and activities, I recommend reading the following resources:
- Chapter 5 (loops and choices) in Advanced R
- Appendix C.2 in Modern Data Science with R
- Chapter 19.1 – 19.5.2 (functions and conditions) in R for Data Science (1st edition)
- Chapter 20 (vectors and lists) in R for Data Science (1st edition)
- Vectors
and lists in the
purrr
tutorial
More practice with functions
Neural networks are a way to learn complex prediction models. Fundamentally, a neural network works by passing input data through a series of nodes; the output of one layer of nodes is the input for the next layer. Each time the data goes through a node, an activation function is applied to transform the output (this allows the network to model nonlinear relationships).
Common activation functions include the ReLU (rectified linear unit):
\[f(x) = \begin{cases} x & x > 0 \\ 0 & x \leq 0 \end{cases}\] and the leaky ReLU, with parameter \(a\):
\[f_a(x) = \begin{cases} x & x > 0 \\ a \cdot x & x \leq 0 \end{cases}\] Indeed, the ReLU could be considered a special case of the leaky ReLU with \(a = 0\).
Here is an implementation of the ReLU function in R:
This relu
function works for single inputs:
## [1] 1
## [1] 0
However, it does not work for vectors of length greater than 1:
## Error in if (x > 0) {: the condition has length > 1
The issue here is that if(x > 0)
in the
if...else...
statement is not vectorized. That is, R is
expecting a single true or false, not a vector. As in HW 2, to vectorize
this function we can use the ifelse
function (which IS
vectorized).
Question 1
Re-write the relu
function above, using the
ifelse
function, so that relu
can be applied
to vectors.
Question 2
Adapt your relu
function from Question 1 to create a new
function, leaky_relu
, which takes TWO inputs, \(x\) and \(a\), and returns \(f_a(x)\) as defined above. Make \(a = 0\) the default value.
Practice with lists
In class, we learned about lists as a hierarchical structure for storing data. In these questions, we will practice more with lists and list indexing.
Question 3
Create a list, x
, such that:
x[[2]][[1]]
stores the functionrnorm
. So,x[[2]][[1]](3)
should sample 3 numbers from a \(N(0, 1)\) distribution.x[[1]]
stores the functionrexp
. So,x[[1]](3)
should sample 3 numbers from an Exponential distribution with rate 1.x[[3]][[1]]
stores an anonymous function which samples from a \(\chi^2_2\) distribution. So,x[[3]][[1]](3)
should sample 3 numbers from a \(\chi^2_2\) distribution.
Logistic regression
As you learned in STA 112, linear regression is used with quantitative responses, whereas logistic regression is used for binary responses (\(Y_i = 0\) or \(1\)). In particular, if \(\pi_i = P(Y_i = 1)\), then a simple logistic regression model (with explanatory variable \(X_i\)) is \[\log \left( \frac{\pi_i}{1 - \pi_i} \right) = \beta_0 + \beta_1 X_i\]
As an example, consider a dataset of 3402 pitches thrown by MLB
pitcher Clayton Kershaw in the 2013 season. The data is contained in the
Kershaw
data set, in the Stat2Data
R package.
We will focus on two specific variables for each pitch:
Result
: a negative result (a ball or a hit), or a positive result (a strike or an out)EndSpeed
: the speed at which the ball crossed home plate (in mph)
Our goal is to investigate the relationship between pitch speed and
result. We can fit a logistic regression model, with
EndSpeed
as the explanatory variable and
Result
as the response:
library(tidyverse)
library(Stat2Data)
data("Kershaw")
log_reg <- glm(Result ~ EndSpeed, family = binomial, data = Kershaw)
summary(log_reg)
##
## Call:
## glm(formula = Result ~ EndSpeed, family = binomial, data = Kershaw)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.432279 0.488616 -2.931 0.003375 **
## EndSpeed 0.023360 0.006019 3.881 0.000104 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4540.4 on 3401 degrees of freedom
## Residual deviance: 4525.4 on 3400 degrees of freedom
## AIC: 4529.4
##
## Number of Fisher Scoring iterations: 4
Here, log_reg
is the resulting logistic regression model
in R. What type of object is log_reg
?
Question 4
Use the typeof
function to determine what type of object
log_reg
is.
Question 5
Use the length
function to determine how many components
are in the top level of log_reg
. (There’s a lot!)
Question 6
Let’s look at the first component: use log_reg[[1]]
to
see what it contains. What is stored here?
How were we meant to know that the first component of
log_reg
contained the coefficients? We probably wouldn’t!
Fortunately, R offers the option to name the components of a
list, which makes it easier to access them without remembering all their
indices. To get the coefficients, for example, we can run the
following:
## (Intercept) EndSpeed
## -1.43227907 0.02336011
Another way of accessing named entries in a list is with the
$
:
## (Intercept) EndSpeed
## -1.43227907 0.02336011
We can see all the names for a list with the names
function:
## [1] "coefficients" "residuals" "fitted.values"
## [4] "effects" "R" "rank"
## [7] "qr" "family" "linear.predictors"
## [10] "deviance" "aic" "null.deviance"
## [13] "iter" "weights" "prior.weights"
## [16] "df.residual" "df.null" "y"
## [19] "converged" "boundary" "model"
## [22] "call" "formula" "terms"
## [25] "data" "offset" "control"
## [28] "method" "contrasts" "xlevels"
This makes it easy to see what information the list contains!
Question 7
How would you extract the AIC for your logistic regression model,
without using the summary
function? Write the code
here.
Logistic regression diagnostics
So far, so good! But, how do we assess the assumptions for this logistic regression model? The usual residual plots (like we would make for linear regression) are not helpful:
data.frame(fitted = log_reg$fitted.values,
residuals = log_reg$y - log_reg$fitted.values) |>
ggplot(aes(x = fitted, y = residuals)) +
geom_abline(slope = 0, intercept = 0, color = "blue") +
geom_point() +
labs(x = "Fitted values", y = "Residuals") +
theme_bw()
Instead of the raw residuals \(Y_i - \widehat{\pi}_i\), it is common in logistic regression to use a type of residual called randomized quantile residuals. For logistic regression, a randomized quantile residual is defined as follows:
\[r_{Q}(Y_i, \widehat{\pi}_i) = \Phi^{-1}(u) \hspace{1cm} u \sim \begin{cases} Uniform(1 - \widehat{\pi}_i, 1) & Y_i = 1 \\ Uniform(0, 1 - \widehat{\pi}_i) & Y_i = 0, \end{cases}\]
In other words:
- If \(Y_i = 1\), draw a random
number \(u\) from a \(Uniform(1 - \widehat{\pi}_i, 1)\)
distribution
- Else (\(Y_i = 0\)), draw a random number \(u\) from a \(Uniform(0, 1 - \widehat{\pi}_i)\) distribution
- Then return \(\Phi^{-1}(u)\), where \(\Phi^{-1}\) is the quantile function for a standard normal distribution
How do we do this? Some hints:
log_reg$fitted.values
will return a vector of \(\widehat{\pi}_i\)’s for the observed datalog_reg$y
will return a vector of \(Y_i\)’s for the observed datarunif(...)
will sample from a uniform distributionifelse
can be used to handle conditional statements- In R, \(\Phi^{-1}(u)\) would be
calculated with
qnorm(u)
Question 8
Using these hints, write a function quant_resid
that
calculates randomized quantile residuals for a logistic regression
model. Your function should take as input a logistic regression model
(output of a call to glm
in R), and return a vector of
randomized quantile residuals (one residual for each observation in the
data).
You may use a for
loop, or use vectorized functions in
R. You may not use any existing quantile residual functions
from R packages.
Quantile residuals, continued: plotting
Above, you wrote a function quant_resid
to generate
randomized quantile residuals for a logistic regression model. Next, you
will use your quant_resid
function to write a new function,
which produces a quantile residual plot for a logistic
regression model.
Question 9
Write a function, quant_resid_plot
, which takes as input
a fitted logistic regression model (the output from glm
in
R), and produces a quantile residual plot.
- The quantile residuals should be plotted on the y-axis
- The fitted values (predicted probabilities) from the logistic regression model should be plotted on the x-axis
- You should add a horizontal line at \(y = 0\)
- The axes should be labeled appropriately
- You may use either base R or
ggplot
for making the plot
Here is sample output for the Kershaw
data (note: these
are randomized quantile residuals, so it is ok if your plot
looks slightly different):
Interpreting the quantile residual plot is similar to interpreting residual plots for linear regression. If the logistic regression assumptions are satisfied, then the quantile residuals should be randomly scattered around the horizontal line at 0, with no clear pattern.