Homework 1
Due: Friday, September 8, 11:00am on Canvas
Instructions:
- Download the HW 1 template, and open the template (a Quarto document) in RStudio.
- Put your name in the file header
- Click
Render
- Type all code and answers in the document (using
###
for section headings and####
for question headings) - Render early and often to catch any errors!
- When you are finished, submit the final rendered HTML to Canvas
Code guidelines:
- If a question requires code, and code is not provided, you will not receive full credit
- You will be graded on the quality of your code. In addition to being
correct, your code should also be easy to read
- No magic numbers
- Use descriptive names for your variables
- Set seeds where needed
- Comment code
Resources: Homework 1 will give you practice with
loops, if
statements, and simulation. In addition to the
class notes and activities, I recommend reading the following
resources:
- Appendix B (overview of R) in Modern Data Science with R
- Chapter 3.1 – 3.2 (an introduction to vectors) in Advanced R
- Chapter 5 (loops and choices) in Advanced R
Practice with for
loops
The purpose of this section is to give you some more practice working with for loops and sequences, which are useful tools for efficiently repeating a process many times. Here is an example for loop that calculates \(\sqrt{x}\) for a sequence of numbers \(x = 0, 0.1, 0.2, ..., 0.9, 1\):
x <- seq(from=0, to=1, by=0.1)
sqrt_x <- rep(0, length(x))
for(i in 1:length(x)){
sqrt_x[i] <- sqrt(x[i])
}
sqrt_x
Below are some short practice questions to help you get more
comfortable creating for
loops.
Question 1
Modify the loop above so that instead of calculating \(\sqrt{x}\), we calculate \(x^{1/3}\).
Question 2
Modify the loop from Question 1 so that instead of considering \(x = 0, 0.1, 0.2, ..., 0.9, 1\) (i.e. the numbers between 0 and 1, in increments of 0.1), we consider \(x = 0, 0.05, 0.10, 0.15, ..., 1.95, 2\) (the numbers between 0 and 2, in increments of 0.05).
Note: In Questions 1 and 2, you are applying a
function to each element in a vector. Here you have used a
for
loop, because the purpose of these questions is to
practice loops. However, for
loops are not always the most
efficient way to write code. Instead, many functions in R are
vectorized: if you apply the function to a vector, it is
applied to each element of the vector. For example,
produces the same output as the for
loop above.
Question 3
Re-write the code for Question 1, using vectorization instead of the
for
loop.
Probability simulation
Consider a tug-of-war competition for robots. In each match up, two robots take turns tugging the rope until the marker indicates that one of the robots won. The match starts with the marker at 0.
- Robot A pulls the rope – use
runif(n=1,min=0,max=0.50)
to simulate the magnitude of the pull. Adding the simulated value to the marker position gives the new position of the marker. - Robot B pulls the rope in the opposite direction – use
runif(n=1,min=0,max=0.50)
to simulate the magnitude of the pull. Adding the simulated value to the marker position gives the new position of the marker. - The two robots continue taking turns until the marker moves past -0.50 or 0.50.
Question 4
Write code that simulates 1000 robot tug of war battles.
Question 5
Report the results of 1000 simulated robot tug of war battles. Is the game fair? If not, what adjustments can be made to make it more fair?
Designing simulation studies
In class, we have started to discuss the use of simulation studies to address statistical questions, such as “How important is the normality assumption in a simple linear regression model?” A simulation study allows us investigate these questions by simulating data under a variety of different conditions (e.g., different violations of the normality assumption), and seeing how the statistical methods behave under these different conditions.
The paper “Using simulation studies to evaluate statistical methods” (Morris et al. 2019) provides a good overview of the important steps in designing a simulation study. Read sections 1 (Introduction) and 3 (Planning simulation studies), and then answer the following questions.
Question 6
What are some reasons researchers use simulation studies?
Question 7
According to the paper, what are the five components (abbreviated ADEMP) involved in planning a simulation study? Summarize each of the five components.
Question 8
In class, we started designing a simulation study to investigate the importance of the normality assumption in simple linear regression. For this simulation, describe each of the ADEMP components.
A new simulation study
We return here to the simple linear regression model:
\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\] Another assumption from STA 112 is that the noise term \(\varepsilon_i\) has constant variance; that is, \(Var(\varepsilon_i) = \sigma^2\) for all observations (the variance does not depend on \(X_i\), e.g.).
Suppose we want to design a simulation study to assess how important the constant variance assumption is. In this section of the assignment, you will plan out your simulation study. In HW 2, you will carry out the simulations.
Question 9
Use the ADEMP framework to plan a simulation study to explore the constant variance assumption. That is, you should describe
- The aims of your study
- How you will generate the data
- What quantity from the regression model you will estimate
- How you will conduct the simulations (the software you will use, how you will calculate your estimates, etc.)
- The performance measure you will use (Hint: use the simulation from class as a guideline!)
You do not need to implement any of the simulations to answer this question.
In class, we used the following code to simulate data from a simple linear regression model with normal errors:
n <- 100
beta0 <- 0.5
beta1 <- 1
x <- runif(n, min=0, max=1)
noise <- rnorm(n, mean=0, sd=1)
y <- beta0 + beta1*x + noise
Notice that the errors here also have constant variance:
sd = 1
for all errors in the simulation. However,
for our new simulation study, we will need to simulate data for which
the standard deviation is different for different observations.
For example, in the simulation above we could set \(SD(\varepsilon_i) = X_i\), or \(SD(\varepsilon_i) = X_i^2\).
Question 10
Modify the code above so that the noise \(\varepsilon_i\) is simulated with \(SD(\varepsilon_i) = X_i\). Then plot
y
vs. x
and confirm that the constant variance
assumption has been violated.