Instruction: Work with a neighbor to answer the following questions, then we will discuss the activity as a class. To get started, download the class activity template file.
In a previous class, we started a simulation to assess how important the normality assumption is in the simple linear regression model
\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]
That is, how important is the assumption that \(\varepsilon_i \sim N(0, \sigma^2)\)?
So far, we have written the following code to simulate data for which the normality assumption is satisfied:
nsim <- 1000
n <- 100 # sample size
beta0 <- 0.5 # intercept
beta1 <- 1 # slope
results <- rep(NA, nsim)
for(i in 1:nsim){
x <- runif(n, min=0, max=1)
noise <- rnorm(n, mean=0, sd=1)
y <- beta0 + beta1*x + noise
lm_mod <- lm(y ~ x)
ci <- confint(lm_mod, "x", level = 0.95)
results[i] <- ci[1] < 1 & ci[2] > 1
}
mean(results)
In particular, the line noise <- rnorm(n, mean=0, sd=1)
ensures the errors come from a normal distribution. Running this code, the coverage of our confidence intervals is approximately 95% (as expected).
Now, we want to know how important it is that the errors \(\varepsilon_i\) be normal. To address that question, we need to see what happens when \(\varepsilon_i\) comes from a different distribution! We can simulate from many different distributions in R, including the following:
rt(...)
rexp(...)
rchisq(...)
rgamma(...)
runif(...)
Experiment with different distributions for the noise term \(\varepsilon_i\) in the code above. How does the confidence interval coverage change?
Does confidence interval coverage depend on the sample size \(n\)?