Class Project
Due: Thursday, December 7, 11:00am on Canvas
Overview
Statistics often involves fitting models, and statistical software is created to fit these models. In this project, you will write your own functions to fit linear and generalized linear models.
I am not expecting you to be familiar with all of the methods described in this project, but you will need to implement them all. If you use computing in your future careers – whether they are academic or in industry – you will very likely need to learn and implement a method you are unfamiliar with.
Requirements:
- You will write a function,
my_glm
, and any other helper functions needed, to fit linear, logistic, and Poisson regression models - You will submit your code, and examples of your function in action
Advice:
- Start small. For example, start with just the estimated coefficients for linear regression. Then add functionality for logistic and Poisson regression. Then add p-value and deviance calculations.
- Check that your code works as you go. Do not wait until the end to test your code!
Rules:
- This project must be completed individually. You may not work with other students in any way, or share any code or help.
- You are welcome to use the class resources and textbooks. You may also use the internet (Stack Exchange, Stack Overflow, etc.) for guidance on small pieces of your code. For example, you may look up R documentation on parsing formulas.
- However, you may not search for existing implementations of any functions needed for this project. For example, you may not use any existing implementations (or their source code) of logistic regression, Fisher scoring, generalized linear models, etc.
- You may not use Chat GPT or other AI tools for any part of this project.
Any violations of these rules will result in a 0 on the project.
Background: Linear and generalized linear models
Linear models
In STA 112, you learned about linear models. For example,
\[Y_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \cdots + \beta_k X_{i,k} + \varepsilon_i,\]
where \(\varepsilon_i \sim N(0, \sigma^2)\). Alternatively, we can write this model in two steps:
\[Y_i \sim N(\mu_i, \sigma^2)\] \[\mu_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \cdots + \beta_k X_{i,k}\]
The first step (the random component of the model) describes the distribution of \(Y_i\). Here \(Y_i\) is Normal, with mean \(\mu_i\) and variance \(\sigma^2\). The mean \(\mu_i\) depends on the explanatory variables (the \(X\)s), and the second step (the systematic component) describes the relationship between \(\mu_i\) and the explanatory variables. Here, \(\mu_i\) is a linear function of the explanatory variables – hence the term linear model!
Generalized linear models
The linear model above assumes that each \(Y_i\) comes from a Normal distribution. But this isn’t always a good assumption! For example, if our outcome is binary (\(Y_i = 0\) or \(Y_i = 1\)), then the Normal distribution is a terrible choice.
Generalized linear models allow for different distributions of the response variable. There are many generalized linear models, for different types of response variable. In this project, we will consider the usual linear regression model, plus two generalized linear models: logistic regression (for binary response variables) and Poisson regression (for count variables, i.e. taking values 0, 1, 2, 3, etc.).
Logistic regression
You have seen logistic regression already in STA 112. As a reminder, binary logistic regression is used when the response \(Y_i\) is either 0 or 1. The full logistic regression model is:
\[Y_i \sim Bernoulli(\mu_i)\]
\[\log \left( \frac{\mu_i}{1 - \mu_i} \right) = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \cdots + \beta_k X_{i,k}\]
Note that the mean of \(Y_i\), \(\mu_i\) (the probability that \(Y_i = 1\)), is no longer a linear function of the explanatory variables! Rather, we have used a logit (log-odds) link function to relate \(\mu_i\) to the \(X\)s. This gives logistic regression its name.
Poisson regression
If you haven’t taken STA 214, you probably haven’t seen Poisson regression. That’s ok! No prior knowledge of Poisson regression is needed to complete this project.
The Poisson regression model is often used when our response \(Y_i\) is a count variable. That is, \(Y_i\) takes non-negative integer values 0, 1, 2, 3, …. etc. One possible distribution for count variables is the Poisson distribution, which is used in the random component for Poisson models. The full Poisson regression model is:
\[Y_i \sim Poisson(\mu_i)\]
\[\log(\mu_i) = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \cdots + \beta_k X_{i,k}\]
Fitting linear and generalized linear models
To explain how linear models and generalized linear models are fit, we need to introduce some new ideas and notation. Let
\[Y = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}\] be the vector of observed responses, let
\[\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix}\]
be the vector containing the regression coefficients, and let
\[X = \begin{bmatrix} 1 & X_{1,1} & X_{1,2} & \cdots & X_{1,k} \\ 1 & X_{2,1} & X_{2,2} & \cdots & X_{2,k} \\ \vdots & \vdots & \vdots & \cdots & \vdots \\ 1 & X_{n,1} & X_{n,2} & \cdots & X_{n,k} \end{bmatrix} \] be the design matrix. Each row of \(X\) represents one observation, and each column represents one variable (the first column, of all 1s, corresponds to the intercept \(\beta_0\)).
Fitting linear regression
The vector \(\widehat{\beta}\) of estimates coefficients is given by
\[\widehat{\beta} = (X^TX)^{-1} X^T Y\]
Fitting a generalized linear model
Linear regression is nice because the estimated coefficients have a closed form. Unfortunately, there is no closed form solution for other generalized linear models. Instead, we use an iterative process called Fisher scoring:
- Step 1: Initialize the coefficients. Let \(\beta^{(0)}\) denote the vector of initial coefficients.
- Step 2: Calculate the score \(U(\beta^{(0)})\) and the information \(I(\beta^{(0)})\) for the initial coefficients (see below).
- Step 3: Update the estimated coefficients:
\[\beta^{(1)} = \beta^{(0)} + (I(\beta^{(0)}))^{-1} U(\beta^{(0)})\]
- Step 4: Continue the process. If \(\beta^{(r)}\) is the estimated coefficients on iteration \(r\), then the estimated coefficients on iteration \(r + 1\) are
\[\beta^{(r+1)} = \beta^{(r)} + (I(\beta^{(r)}))^{-1} U(\beta^{(r)})\]
- Step 5: Stop when the algorithm converges (see below), or when a maximum number of iterations is reached.
Score and information
The score and information for a generalized linear model are functions of the parameter vector \(\beta\):
- \(U(\beta^{(r)}) = X^T(Y - \mu^{(r)})\)
- \(I(\beta^{(r)}) = X^T W^{(r)} X\)
For logistic regression:
\[\mu^{(r)} = \begin{bmatrix} \mu_1^{(r)} \\ \vdots \\ \mu_n^{(r)} \end{bmatrix} = \begin{bmatrix} \frac{\exp\{\beta_0^{(r)} + \beta_1^{(r)} X_{1,1} + \cdots + \beta_k^{(r)} X_{1,k}\}}{1 + \exp\{\beta_0^{(r)} + \beta_1^{(r)} X_{1,1} + \cdots + \beta_k^{(r)} X_{1,k}\}} \\ \vdots \\ \frac{\exp\{\beta_0^{(r)} + \beta_1^{(r)} X_{1,1} + \cdots + \beta_k^{(r)} X_{1,k}\}}{1 + \exp\{\beta_0^{(r)} + \beta_1^{(r)} X_{n,1} + \cdots + \beta_k^{(r)} X_{n,k}\}} \end{bmatrix} = \begin{bmatrix} \frac{\exp\{X_1^T \beta^{(r)}\}}{1 + \exp\{X_1^T \beta^{(r)}\}} \\ \vdots \\ \frac{\exp\{X_n^T \beta^{(r)}\}}{1 + \exp\{X_n^T \beta^{(r)}\}} \end{bmatrix} = \frac{\exp\{ X \beta^{(r)}\}}{1 + \exp\{ X \beta^{(r)}\}}\]
That is, \(\mu^{(r)}\) is the vector of estimated means at iteration \(r\).
\(X_i^T\) is the \(i\)th row of the design matrix \(X\).
On the right-hand side of the equation, \(X \beta^{(r)}\) is a vector, and \(\exp\{X \beta^{(r)}\}\) represents exponentiating each entry in that vector.
\[W^{(r)} = \begin{bmatrix} \mu_1^{(r)}(1 - \mu_1^{(r)}) & 0 & 0 & \cdots & 0 \\ 0 & \mu_2^{(r)}(1 - \mu_2^{(r)}) & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \cdots & \vdots \\ 0 & 0 & 0 & \cdots & \mu_n^{(r)}(1 - \mu_n^{(r)}) \end{bmatrix}\]
That is, \(W^{(r)}\) is the diagnoal matrix with entries \(\mu_i^{(r)}(1 - \mu_i^{(r)})\).
For Poisson regression:
\[\mu^{(r)} = \begin{bmatrix} \mu_1^{(r)} \\ \vdots \\ \mu_n^{(r)} \end{bmatrix} = \begin{bmatrix} \exp\{\beta_0^{(r)} + \beta_1^{(r)} X_{1,1} + \cdots + \beta_k^{(r)} X_{1,k}\} \\ \vdots \\ \exp\{\beta_0^{(r)} + \beta_1^{(r)} X_{n,1} + \cdots + \beta_k^{(r)} X_{n,k}\} \end{bmatrix} = \begin{bmatrix} \exp\{X_1^T \beta^{(r)}\} \\ \vdots \\ \exp\{X_n^T \beta^{(r)}\} \end{bmatrix} = \exp\{ X \beta^{(r)} \}\]
That is, \(\mu^{(r)}\) is the vector of estimated means at iteration \(r\).
\[W^{(r)} = \begin{bmatrix} \mu_1^{(r)} & 0 & 0 & \cdots & 0 \\ 0 & \mu_2^{(r)} & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \cdots & \vdots \\ 0 & 0 & 0 & \cdots & \mu_n^{(r)} \end{bmatrix}\]
That is, \(W^{(r)}\) is the diagnoal matrix with entries \(\mu_i^{(r)}\).
Initialization
Note that for linear regression, no initialization is required (because we don’t need to use Fisher scoring, because there is a closed-form solution for \(\widehat{\beta}\)).
For logistic regression
Let \(\bar{Y} = \frac{1}{n} \sum \limits_{i=1}^n Y_i\). One way to initialize logistic regression is to make
\[\beta^{(0)} = \left( \log \left( \frac{\bar{Y}}{1 - \bar{Y}} \right), 0, 0, ..., 0 \right)^T\] That is, the initial value of the intercept is \(\log \left( \frac{\bar{Y}}{1 - \bar{Y}} \right)\), and the initial value of all other coefficients is 0.
For Poisson regression
Let \(\bar{Y} = \frac{1}{n} \sum \limits_{i=1}^n Y_i\). One way to initialize Poisson regression is to make
\[\beta^{(0)} = \left( \log (\bar{Y}), 0, 0, ..., 0 \right)^T\] That is, the initial value of the intercept is \(\log (\bar{Y})\), and the initial value of all other coefficients is 0.
Convergence
There is no convergence to worry about for linear regression (because there is a closed form solution for the coefficients). When Fisher scoring is required, however, we need to define a convergence criterion that specifies when to stop the scoring algorithm.
Convergence is usually defined in terms of the residual deviance of the fitted model. Think of residual deviance as the equivalent of the residual sum of squares from linear regression. Formally, deviance is twice the difference in the log-likelihood between a saturated model and the fitted model.
It is ok if you don’t know what that means! The deviances are defined below for you. You can also find advice on calculating the deviance in R in the “Implementation in R” section below.
Convergence: For the purposes of this project, say that the Fisher scoring algorithm has converged when the change in deviance between iterations is \(< 0.001\).
For linear regression
For linear regression (with a Normal response), the residual deviance is given by
\[\sum \limits_{i=1}^n (Y_i - \widehat{\mu}_i)^2\]
where \(\widehat{\mu}_i = \widehat{\beta}_0 + \widehat{\beta}_1 X_{i,1} + \cdots + \widehat{\beta}_k X_{i, k}\).
For logistic regression
For logistic regression with a binary response, the residual deviance is given by
\[-2 \sum \limits_{i=1}^n \left(Y_i \log(\widehat{\mu}_i) + (1 - Y_i) \log(1 - \widehat{\mu}_i) \right)\]
where \(\widehat{\mu}_i = \dfrac{\exp\{\widehat{\beta}_0 + \widehat{\beta}_1 X_{i,1} + \cdots + \widehat{\beta}_k X_{i, k}\}}{1 + \exp\{\widehat{\beta}_0 + \widehat{\beta}_1 X_{i,1} + \cdots + \widehat{\beta}_k X_{i, k}\}}\).
For Poisson regression
For Poisson regression, the residual deviance is given by
\[2 \sum \limits_{i=1}^n \left(Y_i \log\left( \frac{Y_i}{\widehat{\mu}_i}\right) - (Y_i - \widehat{\mu}_i) \right)\]
where we define \(0 \log(0) = 0\), and \(\widehat{\mu}_i = \exp\{\widehat{\beta}_0 + \widehat{\beta}_1 X_{i,1} + \cdots + \widehat{\beta}_k X_{i, k}\}\).
Hypothesis testing
For each coefficient \(\beta_j\), we can test the hypotheses \(H_0: \beta_j = 0\) vs. \(H_A: \beta_j \neq 0\).
Linear regression
For a linear regression model, the test statistic is
\[T = \frac{\widehat{\beta}_j}{SE(\widehat{\beta}_j)}\]
where \(SE(\widehat{\beta}_j) = \sqrt{\widehat{\sigma}^2 (X^T X)^{-1}_{jj}}\) (the square root of the \(j\)th diagonal entry of \(\widehat{\sigma}^2 (X^T X)^{-1}\)). Here
\[\widehat{\sigma}^2 = \frac{1}{n - (k + 1)} \sum \limits_{i=1}^n (Y_i - \widehat{Y}_i)^2\]
The p-value is then calculated using a \(t_{n-(k+1)}\) distribution.
Logistic and Poisson regression
For a logistic or Poisson regression model, the test statistic is
\[Z = \frac{\widehat{\beta}_j}{SE(\widehat{\beta}_j)}\]
where \(SE(\widehat{\beta}_j) = \sqrt{(X^T W X)^{-1}_{jj}}\) (the square root of the \(j\)th diagonal entry of the inverse Fisher information matrix).
The p-value is then calculated using a standard normal distribution.
Hints on implementation in R
Matrices and matrix algebra
- \(X^T\) is the transpose
of \(X\). In R:
t(X)
- Matrix multiplication in R is done with
%*%
. For example, \(X^T y\) would bet(X) %*% y
- Matrix inverses are found with
solve
. For example, \((X^T X)^{-1}\) would besolve(t(X) %*% X)
- Diagonal matrices can be created with the
diag
function. For example,diag(c(1, 2, 3))
creates a diagonal matrix with the entries 1, 2, 3 on the diagonal:
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 2 0
## [3,] 0 0 3
- Make sure to follow the ordering specified in the definitions above.
Matrix operations are not commutative in general! E.g.
t(X) %*% X
is not the same asX %*% t(X)
- If you get an error that says
non-conformable arguments
, you have made a mistake with your matrix algebra, such as creating matrices of the wrong size, or multiplying in the incorrect order
Calculating deviances
Suppose you have a vector y
in R, which contains the
observed responses \(Y_1,...,Y_n\). And
suppose the vector mu_hat
in R contains the estimated means
\(\widehat{\mu}_1,...,\widehat{\mu}_n\).
Then, the deviance for each model can be calculated in R as follows:
- Linear regression:
sum((y - mu_hat)^2)
- Logistic regression:
-2*sum(dbinom(y, 1, mu_hat, log=T))
- Poisson regression:
2*(sum(dpois(y, y, log=T)) - sum(dpois(y, mu_hat, log=T)))
Calculating p-values
The pnorm
and pt
functions will be
useful.
Requirements and examples
Code requirements
For this project, you must write at least one function in R:
my_glm
: a function for fitting linear, logistic, and Poisson GLMs
You may write other helper functions too, but the choice of helper functions is up to you. You may find it helpful to write separate functions for the score, the Fisher information, and the Fisher scoring algorithm.
These functions must satisfy the following requirements:
my_glm
:
- Inputs: there are two options (Option 2 is more advanced, and will
earn you a higher grade)
- Option 1:
X
: design matrixy
: vector of responsesfamily
: a string, either"gaussian"
for linear regression,"binomial"
for logistic regression, or"poisson"
for Poisson regressionmax_iter
: the maximum number of iterations for Fisher scoring; the default should be 50
- Option 2:
formula
: a formula that describes the model, like you would use inlm
. E.g.y ~ x
, ory ~ x1 + x2
, etc. (If you choose the formula option, you do not need to support categorical explanatory variables, transformations, or interactions; you may receive bonus points if you do)family
: a string, either"gaussian"
for linear regression,"binomial"
for logistic regression, or"poisson"
for Poisson regressiondata
: a data frame containing the variables in the model, with the same names as used in the formulamax_iter
: the maximum number of iterations for Fisher scoring; the default should be 50
- Option 1:
- Outputs: a list containing the following named components:
coefficients
: a matrix or data frame containing four columns: the estimated coefficients, standard errors for the estimated coefficients, test statistics (\(Z\) or \(T\) statistics), and p-values for the hypothesis test \(H_0: \beta_j = 0\) vs. \(H_A: \beta_j \neq 0\). See below for examples.iterations
: the number of iterations of the Fisher scoring algorithm used to estimate the coefficients (if linear regression,iterations = 1
)family
: the type of GLM ("gaussian"
for linear regression,"binomial"
for logistic regression, or"poisson"
for Poisson regression)deviance
: the residual deviance for the fitted model
- You may not use any existing implementations of GLM functions
- Your function cannot be specific to one data set, and cannot be specific to only a fixed number of explanatory variables
Examples
Below are examples of the functions in action. You should implement the same functionality in your code. You can use the same simulated data to verify that your code works.
Linear regression, Option 1
# Simulate some data
set.seed(279)
n <- 100
x1 <- runif(n)
x2 <- rnorm(n)
y <- rnorm(n, 1 + x1 + x2, 2)
# create the design matrix
X <- cbind(1, x1, x2)
# fit the model
m1 <- my_glm_1(X, y, "gaussian")
# model output
m1$coefficients
## beta se t_score p_value
## (Intercept) 1.1182111 0.4321904 2.587311 1.115660e-02
## x1 1.4126586 0.7134387 1.980070 5.052892e-02
## x2 0.9552916 0.1878524 5.085330 1.788912e-06
## [1] 1
## [1] "gaussian"
## [1] 397.8476
Linear regression, Option 2
# Simulate some data
set.seed(279)
n <- 100
x1 <- runif(n)
x2 <- rnorm(n)
y <- rnorm(n, 1 + x1 + x2, 2)
# create the data
df <- data.frame(y, x1, x2)
# fit the model
m1 <- my_glm_2(y ~ x1 + x2, family = "gaussian", data = df)
# model output
m1$coefficients
## beta se t_score p_value
## (Intercept) 1.1182111 0.4321904 2.587311 1.115660e-02
## x1 1.4126586 0.7134387 1.980070 5.052892e-02
## x2 0.9552916 0.1878524 5.085330 1.788912e-06
## [1] 1
## [1] "gaussian"
## [1] 397.8476
Linear regression, Option 2 with real data
library(palmerpenguins)
library(tidyverse)
df <- penguins |>
drop_na()
m1 <- my_glm_2(body_mass_g ~ flipper_length_mm + bill_length_mm,
family = "gaussian", data = df)
m1$coefficients
## beta se t_score p_value
## (Intercept) -5836.298732 312.603503 -18.669972 1.341791e-53
## flipper_length_mm 48.889692 2.034204 24.033815 1.737931e-74
## bill_length_mm 4.958601 5.213505 0.951107 3.422461e-01
## [1] 1
## [1] "gaussian"
## [1] 51071963
Logistic regression, Option 1
# simulate some data
set.seed(123)
n <- 200
x <- rnorm(n)
p <- exp(1 + x)/(1 + exp(1 + x))
y <- rbinom(n, 1, p)
# create the design matrix
X <- cbind(1, x)
# fit the model
m1 <- my_glm_1(X, y, "binomial")
# model output
m1$coefficients
## beta se z_score p_value
## (Intercept) 1.189801 0.1839126 6.469385 9.840288e-11
## x 0.922279 0.2143541 4.302595 1.688094e-05
## [1] 4
## [1] "binomial"
## [1] 208.6589
Logistic regression, Option 2 with real data
df <- penguins |>
drop_na() |>
mutate(sex = ifelse(sex == "female", 1, 0))
m1 <- my_glm_2(sex ~ flipper_length_mm + bill_length_mm,
family = "binomial", data = df)
m1$coefficients
## beta se z_score p_value
## (Intercept) 7.005985900 1.72762038 4.0552809 5.007409e-05
## flipper_length_mm -0.007736914 0.01108136 -0.6981915 4.850574e-01
## bill_length_mm -0.124374714 0.02952819 -4.2120671 2.530444e-05
## [1] 3
## [1] "binomial"
## [1] 419.9377
Poisson regression, Option 1
# simulate some data
set.seed(214)
n <- 300
x1 <- rbinom(n, 1, 0.5)
x2 <- runif(n)
x3 <- runif(n)
y <- rpois(n, exp(0.5 - x1 + x2 - 0.5*x3))
# create the design matrix
X <- cbind(1, x1, x2, x3)
# fit the model
m1 <- my_glm_1(X, y, "poisson")
# model output
m1$coefficients
## beta se z_score p_value
## (Intercept) 0.4096576 0.1288543 3.179232 1.476660e-03
## x1 -0.9375865 0.1000650 -9.369779 7.268426e-21
## x2 1.0644028 0.1686682 6.310630 2.779011e-10
## x3 -0.3281246 0.1686991 -1.945029 5.177144e-02
## [1] 4
## [1] "poisson"
## [1] 358.0896
Submission requirements
You will submit two files:
yourname_project1.R
: an R script (not an R Markdown or Quarto document) containing yourmy_glm
function, and any additional helper functions (e.g. Fisher scoring, etc.)yourname_project1_examples.html
: a knitted Quarto or R Markdown document showing examples of your glm function in action with linear, logistic, and Poisson regression
Grading
There are minimal requirements for each grade range:
- To receive a grade in the C range (C- to C+):
- Your code must run (if your code does not run, you cannot pass the project)
- You must correctly implement linear, logistic, and Poisson regression with Option 1, and return the estimated coefficients (if your estimated coefficients are incorrect, you cannot pass the project)
- To receive a grade in the B range (B- to B+):
- Your code must run (if your code does not run, you cannot pass the project)
- You must correctly implement linear, logistic, and Poisson regression with Option 1
- You must return all required output, and it must be correct: the coefficients, the number of scoring iterations, the family, and the deviance
- You must demonstrate that your code works with examples of linear, logistic, and Poisson regression
- To receive a grade in the A range (A- or A):
- Your code must run (if your code does not run, you cannot pass the project)
- You must correctly implement linear, logistic, and Poisson regression with Option 2
- You must return all required output, and it must be correct: the coefficients, the number of scoring iterations, the family, and the deviance
- You must demonstrate that your code works with examples of linear, logistic, and Poisson regression
Within each grade range, scores will also involve code readability, clarity, and commenting.
Summary: project checklist
yourname_project1.R
- Needs to contain at least one function (
my_glm
) and any additional helper functions - Check that
my_glm
takes the correct inputs and returns the correct outputs - Needs to support linear, logistic, and Poisson regression
- Needs to contain at least one function (
yourname_project1_examples.html
- Examples of fitting linear, logistic, and Poisson models