Homework 2
Due: Friday, September 15, 11:00am on Canvas
Instructions:
- Download the HW 2 template, and open the template (a Quarto document) in RStudio.
- Put your name in the file header
- Click
Render
- Type all code and answers in the document (using
###
for section headings and####
for question headings) - Render early and often to catch any errors!
- When you are finished, submit the final rendered HTML to Canvas
Code guidelines:
- If a question requires code, and code is not provided, you will not receive full credit
- You will be graded on the quality of your code. In addition to being
correct, your code should also be easy to read
- No magic numbers
- Use descriptive names for your variables
- Set seeds where needed
- Comment code
- If a block of code is being called multiple times, put it in a function
Resources: Homework 2 will give you practice with
loops, if
statements, simulation, and functions. In
addition to the class notes and activities, I recommend reading the
following resources:
- Chapter 3.1 – 3.2 (an introduction to vectors) in Advanced R
- Chapter 5 (loops and choices) in Advanced R
- Appendix C.2 in Modern Data Science with R
- Chapter 19.1 – 19.5.2 (functions and conditions) in R for Data Science (1st edition)
Probability simulation
Suppose there are exactly 100 seats in a movie theater. Further, suppose the movie theater has imposed a new ticketing system where each person’s has an assigned seat number. Consider the case where the movie is sold-out (all 100 seats will be taken) and the first person in the theater lost their ticket and takes a random seat. Each subsequent person attempts to go to their seat. If their seat is open, they take it. Otherwise, to avoid confrontation, they choose a random seat.
Question 1
Conduct a simulation to assess the probability that the last person into the theater will get their assigned seat.
Tips: Here are some tips for one approach to this simulation.
- Create a vector
seats
, containing the numbers 1,…,100 in order, to represent the (numbered) seats in the theater - Create a vector,
taken
, to represent which seats have been taken. At the beginning of each run of the simulation, all entries oftaken
are 0. - When a seat gets taken, change the corresponding entry in
taken
to 1 - Use
sample
to randomly choose a seat from the ones available seats[taken == 0]
will tell you which seats are available
Probability simulation
Consider the following \((a + 1)\) player game (Gal and Miltersen, 2007). There are \(a\) boxes with labels \(1, ..., a\) and slips of paper labeled \(1, 2, ..., a\) The lead player colors each slip of paper either red or blue and puts each slip of paper in a box so there is one and only one slip of paper per box, without the other \(a\) players observing.
Now, each player \(i ∈ {1, 2, ..., a}\) can look in at most \(a/2\) boxes and based on this make a guess about the color of the slip \(i\). This done by each player in isolation. The \(a\) players win if every player correctly announces the color of “their” slip.
Question 2
Conduct a simulation where the lead player colors the slips and adds them to the boxes at random. The other players randomly choose which boxes to open. What proportion of the time do all of the players see their own slip, guaranteeing they win? Does the proportion depend on \(a\)?
Tip: To efficiently check whether a vector contains a
specific value, you can use %in%
. For example,
## [1] TRUE
## [1] FALSE
Question 3
Conduct a simulation where the lead player colors the slips and adds them to the boxes at random. The other players randomly choose which boxes to open. Suppose a player that doesn’t see their own slip randomly guesses either red or blue. What proportion of the time do the \(a\) players win?
Practice with functions
The Huber loss function is defined as
\[\psi(x) = \begin{cases} x^2 & \text{if } |x| \leq 1 \\ 2|x| - 1 & \text{if } |x| > 1 \end{cases}\]
This function is quadratic on the interval \([-1,1]\), and linear outside of this interval. It transitions from quadratic to linear “smoothly”, and it is often used in place of the usual squared error loss for robust estimation.
Question 4
Write a function huber()
that takes as an input a number
\(x\), and returns the Huber value
\(\psi(x)\). Here is a code skeleton to
get you started:
Hint: The ifelse
function from class will be
helpful!
Question 5
HW 1 introduced the idea of vectorization. The
ifelse
function is vectorized, so if you used the
ifelse
function in Question 4, your huber
function should also be vectorized.
Check that the function is vectorized: e.g.,
huber(c(0.5, 1, 2))
should return
0.25 1.00 3.00
The Huber function can be modified so that the transition from quadratic to linear happens at an arbitrary cutoff value \(a\), as in:
\[\psi_a(x) = \begin{cases} x^2 & \text{if } |x| \leq a \\ 2|x| - 1 & \text{if } |x| > a \end{cases}\]
Question 6
Starting with your solution code to Question 4, update your
huber()
function so that it takes two arguments: \(x\), a number at which to evaluate the
loss, and \(a\) a number representing
the cutoff value. It should now return \(\psi_a(x)\), as defined above. Make \(a = 1\) the default in your
huber
function.
Continuing simulation study
In this part of the assignment, we will continue working on the simulation from HW 1, to assess the importance of the constant variance assumption in simple linear regression models.
For this assignment, you will simulate data from the following model:
\[Y_i = 0.5 + X_i + \varepsilon_i,\] where \(X_i \sim Uniform(0, 1)\) and \(\varepsilon_i \sim N(0, \sigma^2(X_i))\). You will use \(n_{sim} = 10000\) simulations, and you will assess performance using observed coverage of 95% confidence intervals for \(\beta_1 = 1\). You will simulate data under three different scenarios:
- \(\sigma(X_i) = 1\) (constant variance satisfied)
- \(\sigma(X_i) = X_i\)
- \(\sigma(X_i) = X_i^2\)
Question 7
Adapting code from class, simulate \(n = 50\) observations \((X_1, Y_1),...,(X_{50}, Y_{50})\) from the model above, with \(\sigma(X_i) = 1\). Calculate a 95% confidence interval for \(\beta_1\), and check whether the confidence interval does indeed capture \(\beta_1 = 1\).
Question 8
Repeat Question 7 nsim = 10000
times. What fraction of
your 95% confidence intervals actually contained \(\beta_1 = 1\)?
Question 9
Repeat Question 8, but this time use \(\sigma(X_i) = X_i\) (see code from HW 1). Now what fraction of the 95% confidence intervals actually contained \(\beta_1= 1\)?
Question 10
Repeat Question 8, but this time use \(\sigma(X_i) = X_i^2\). Now what fraction of the 95% confidence intervals actually contained \(\beta_1= 1\)?
Question 11
Repeat Questions 8-10, but this time use a sample size of \(n = 10\). Does your coverage change?
Question 12
Repeat Questions 8-10, but this time use a sample size of \(n = 1000\). Does your coverage change?
Question 13
Fill in the following table to display your simulation results:
n (sample size) | \(\sigma(X_i) = 1\) | \(\sigma(X_i) = X_i\) | \(\sigma(X_i) = X_i^2\) |
---|---|---|---|
10 | |||
50 | |||
1000 |
(For each entry in the table, report the observed confidence interval coverage in your simulations). See the Quarto table documentation for help on formatting tables.