Homework 1

Due: Friday, September 8, 11:00am on Canvas

Instructions:

Download the HW 1 template, and open the template (a Quarto document) in RStudio.
Put your name in the file header
Click Render
Type all code and answers in the document (using ### for section headings and #### for question headings)
Render early and often to catch any errors!
When you are finished, submit the final rendered HTML to Canvas

Code guidelines:

If a question requires code, and code is not provided, you will not receive full credit
You will be graded on the quality of your code. In addition to being correct, your code should also be easy to read
- No magic numbers
- Use descriptive names for your variables
- Set seeds where needed
- Comment code

Resources: Homework 1 will give you practice with loops, if statements, and simulation. In addition to the class notes and activities, I recommend reading the following resources:

Appendix B (overview of R) in Modern Data Science with R
Chapter 3.1 – 3.2 (an introduction to vectors) in Advanced R
Chapter 5 (loops and choices) in Advanced R

Practice with `for` loops

The purpose of this section is to give you some more practice working with for loops and sequences, which are useful tools for efficiently repeating a process many times. Here is an example for loop that calculates \(\sqrt{x}\) for a sequence of numbers \(x = 0, 0.1, 0.2, ..., 0.9, 1\):

x <- seq(from=0, to=1, by=0.1)
sqrt_x <- rep(0, length(x))
for(i in 1:length(x)){
  sqrt_x[i] <- sqrt(x[i])
}
sqrt_x

Below are some short practice questions to help you get more comfortable creating for loops.

Question 1

Modify the loop above so that instead of calculating \(\sqrt{x}\), we calculate \(x^{1/3}\).

Question 2

Modify the loop from Question 1 so that instead of considering \(x = 0, 0.1, 0.2, ..., 0.9, 1\) (i.e. the numbers between 0 and 1, in increments of 0.1), we consider \(x = 0, 0.05, 0.10, 0.15, ..., 1.95, 2\) (the numbers between 0 and 2, in increments of 0.05).

Note: In Questions 1 and 2, you are applying a function to each element in a vector. Here you have used a for loop, because the purpose of these questions is to practice loops. However, for loops are not always the most efficient way to write code. Instead, many functions in R are vectorized: if you apply the function to a vector, it is applied to each element of the vector. For example,

x <- seq(from=0, to=1, by=0.1)
sqrt_x <- sqrt(x)
sqrt_x

produces the same output as the for loop above.

Question 3

Re-write the code for Question 1, using vectorization instead of the for loop.

Probability simulation

Consider a tug-of-war competition for robots. In each match up, two robots take turns tugging the rope until the marker indicates that one of the robots won. The match starts with the marker at 0.

Robot A pulls the rope – use runif(n=1,min=0,max=0.50) to simulate the magnitude of the pull. Adding the simulated value to the marker position gives the new position of the marker.
Robot B pulls the rope in the opposite direction – use runif(n=1,min=0,max=0.50) to simulate the magnitude of the pull. Adding the simulated value to the marker position gives the new position of the marker.
The two robots continue taking turns until the marker moves past -0.50 or 0.50.

Question 4

Write code that simulates 1000 robot tug of war battles.

Question 5

Report the results of 1000 simulated robot tug of war battles. Is the game fair? If not, what adjustments can be made to make it more fair?

Designing simulation studies

In class, we have started to discuss the use of simulation studies to address statistical questions, such as “How important is the normality assumption in a simple linear regression model?” A simulation study allows us investigate these questions by simulating data under a variety of different conditions (e.g., different violations of the normality assumption), and seeing how the statistical methods behave under these different conditions.

The paper “Using simulation studies to evaluate statistical methods” (Morris et al. 2019) provides a good overview of the important steps in designing a simulation study. Read sections 1 (Introduction) and 3 (Planning simulation studies), and then answer the following questions.

Question 6

What are some reasons researchers use simulation studies?

Question 7

According to the paper, what are the five components (abbreviated ADEMP) involved in planning a simulation study? Summarize each of the five components.

Question 8

In class, we started designing a simulation study to investigate the importance of the normality assumption in simple linear regression. For this simulation, describe each of the ADEMP components.

A new simulation study

We return here to the simple linear regression model:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\] Another assumption from STA 112 is that the noise term \(\varepsilon_i\) has constant variance; that is, \(Var(\varepsilon_i) = \sigma^2\) for all observations (the variance does not depend on \(X_i\), e.g.).

Suppose we want to design a simulation study to assess how important the constant variance assumption is. In this section of the assignment, you will plan out your simulation study. In HW 2, you will carry out the simulations.

Question 9

Use the ADEMP framework to plan a simulation study to explore the constant variance assumption. That is, you should describe

The aims of your study
How you will generate the data
What quantity from the regression model you will estimate
How you will conduct the simulations (the software you will use, how you will calculate your estimates, etc.)
The performance measure you will use (Hint: use the simulation from class as a guideline!)

You do not need to implement any of the simulations to answer this question.

In class, we used the following code to simulate data from a simple linear regression model with normal errors:

n <- 100
beta0 <- 0.5
beta1 <- 1
x <- runif(n, min=0, max=1)
noise <- rnorm(n, mean=0, sd=1)
y <- beta0 + beta1*x + noise

Notice that the errors here also have constant variance: sd = 1 for all errors in the simulation. However, for our new simulation study, we will need to simulate data for which the standard deviation is different for different observations. For example, in the simulation above we could set \(SD(\varepsilon_i) = X_i\), or \(SD(\varepsilon_i) = X_i^2\).

Question 10

Modify the code above so that the noise \(\varepsilon_i\) is simulated with \(SD(\varepsilon_i) = X_i\). Then plot y vs. x and confirm that the constant variance assumption has been violated.