Homework 2

Due: Friday, September 15, 11:00am on Canvas

Instructions:

Download the HW 2 template, and open the template (a Quarto document) in RStudio.
Put your name in the file header
Click Render
Type all code and answers in the document (using ### for section headings and #### for question headings)
Render early and often to catch any errors!
When you are finished, submit the final rendered HTML to Canvas

Code guidelines:

If a question requires code, and code is not provided, you will not receive full credit
You will be graded on the quality of your code. In addition to being correct, your code should also be easy to read
- No magic numbers
- Use descriptive names for your variables
- Set seeds where needed
- Comment code
- If a block of code is being called multiple times, put it in a function

Resources: Homework 2 will give you practice with loops, if statements, simulation, and functions. In addition to the class notes and activities, I recommend reading the following resources:

Chapter 3.1 – 3.2 (an introduction to vectors) in Advanced R
Chapter 5 (loops and choices) in Advanced R
Appendix C.2 in Modern Data Science with R
Chapter 19.1 – 19.5.2 (functions and conditions) in R for Data Science (1st edition)

Probability simulation

Suppose there are exactly 100 seats in a movie theater. Further, suppose the movie theater has imposed a new ticketing system where each person’s has an assigned seat number. Consider the case where the movie is sold-out (all 100 seats will be taken) and the first person in the theater lost their ticket and takes a random seat. Each subsequent person attempts to go to their seat. If their seat is open, they take it. Otherwise, to avoid confrontation, they choose a random seat.

Question 1

Conduct a simulation to assess the probability that the last person into the theater will get their assigned seat.

Tips: Here are some tips for one approach to this simulation.

Create a vector seats, containing the numbers 1,…,100 in order, to represent the (numbered) seats in the theater
Create a vector, taken, to represent which seats have been taken. At the beginning of each run of the simulation, all entries of taken are 0.
When a seat gets taken, change the corresponding entry in taken to 1
Use sample to randomly choose a seat from the ones available
seats[taken == 0] will tell you which seats are available

Probability simulation

Consider the following \((a + 1)\) player game (Gal and Miltersen, 2007). There are \(a\) boxes with labels \(1, ..., a\) and slips of paper labeled \(1, 2, ..., a\) The lead player colors each slip of paper either red or blue and puts each slip of paper in a box so there is one and only one slip of paper per box, without the other \(a\) players observing.

Now, each player \(i ∈ {1, 2, ..., a}\) can look in at most \(a/2\) boxes and based on this make a guess about the color of the slip \(i\). This done by each player in isolation. The \(a\) players win if every player correctly announces the color of “their” slip.

Question 2

Conduct a simulation where the lead player colors the slips and adds them to the boxes at random. The other players randomly choose which boxes to open. What proportion of the time do all of the players see their own slip, guaranteeing they win? Does the proportion depend on \(a\)?

Tip: To efficiently check whether a vector contains a specific value, you can use %in%. For example,

3 %in% c(1, 2, 3)

## [1] TRUE

4 %in% c(1, 2, 3)

## [1] FALSE

Question 3

Conduct a simulation where the lead player colors the slips and adds them to the boxes at random. The other players randomly choose which boxes to open. Suppose a player that doesn’t see their own slip randomly guesses either red or blue. What proportion of the time do the \(a\) players win?

Practice with functions

The Huber loss function is defined as

\[\psi(x) = \begin{cases} x^2 & \text{if } |x| \leq 1 \\ 2|x| - 1 & \text{if } |x| > 1 \end{cases}\]

This function is quadratic on the interval \([-1,1]\), and linear outside of this interval. It transitions from quadratic to linear “smoothly”, and it is often used in place of the usual squared error loss for robust estimation.

Question 4

Write a function huber() that takes as an input a number \(x\), and returns the Huber value \(\psi(x)\). Here is a code skeleton to get you started:

huber <- function(x){

}

Hint: The ifelse function from class will be helpful!

Question 5

HW 1 introduced the idea of vectorization. The ifelse function is vectorized, so if you used the ifelse function in Question 4, your huber function should also be vectorized.

Check that the function is vectorized: e.g., huber(c(0.5, 1, 2)) should return 0.25 1.00 3.00

The Huber function can be modified so that the transition from quadratic to linear happens at an arbitrary cutoff value \(a\), as in:

\[\psi_a(x) = \begin{cases} x^2 & \text{if } |x| \leq a \\ 2|x| - 1 & \text{if } |x| > a \end{cases}\]

Question 6

Starting with your solution code to Question 4, update your huber() function so that it takes two arguments: \(x\), a number at which to evaluate the loss, and \(a\) a number representing the cutoff value. It should now return \(\psi_a(x)\), as defined above. Make \(a = 1\) the default in your huber function.

Continuing simulation study

In this part of the assignment, we will continue working on the simulation from HW 1, to assess the importance of the constant variance assumption in simple linear regression models.

For this assignment, you will simulate data from the following model:

\[Y_i = 0.5 + X_i + \varepsilon_i,\] where \(X_i \sim Uniform(0, 1)\) and \(\varepsilon_i \sim N(0, \sigma^2(X_i))\). You will use \(n_{sim} = 10000\) simulations, and you will assess performance using observed coverage of 95% confidence intervals for \(\beta_1 = 1\). You will simulate data under three different scenarios:

\(\sigma(X_i) = 1\) (constant variance satisfied)
\(\sigma(X_i) = X_i\)
\(\sigma(X_i) = X_i^2\)

Question 7

Adapting code from class, simulate \(n = 50\) observations \((X_1, Y_1),...,(X_{50}, Y_{50})\) from the model above, with \(\sigma(X_i) = 1\). Calculate a 95% confidence interval for \(\beta_1\), and check whether the confidence interval does indeed capture \(\beta_1 = 1\).

Question 8

Repeat Question 7 nsim = 10000 times. What fraction of your 95% confidence intervals actually contained \(\beta_1 = 1\)?

Question 9

Repeat Question 8, but this time use \(\sigma(X_i) = X_i\) (see code from HW 1). Now what fraction of the 95% confidence intervals actually contained \(\beta_1= 1\)?

Question 10

Repeat Question 8, but this time use \(\sigma(X_i) = X_i^2\). Now what fraction of the 95% confidence intervals actually contained \(\beta_1= 1\)?

Question 11

Repeat Questions 8-10, but this time use a sample size of \(n = 10\). Does your coverage change?

Question 12

Repeat Questions 8-10, but this time use a sample size of \(n = 1000\). Does your coverage change?

Question 13

Fill in the following table to display your simulation results:

n (sample size)	\(\sigma(X_i) = 1\)	\(\sigma(X_i) = X_i\)	\(\sigma(X_i) = X_i^2\)
10
50
1000

(For each entry in the table, report the observed confidence interval coverage in your simulations). See the Quarto table documentation for help on formatting tables.