Homework 4

Due: Friday, September 29, 11:00am on Canvas

Instructions:

  • Download the HW 4 template, and open the template (a Quarto document) in RStudio.
  • Put your name in the file header
  • Click Render
  • Type all code and answers in the document (using ### for section headings and #### for question headings)
  • Render early and often to catch any errors!
  • When you are finished, submit the final rendered HTML to Canvas

Code guidelines:

  • If a question requires code, and code is not provided, you will not receive full credit
  • You will be graded on the quality of your code. In addition to being correct, your code should also be easy to read
    • No magic numbers
    • Use descriptive names for your variables
    • Set seeds where needed
    • Comment code
    • If a block of code is being called multiple times, put it in a function

Resources: In addition to the class notes and activities, I recommend reading the following resources:

  • Chapter 4 (NumPy basics) in Python for Data Analysis (3rd edition)
  • Appendix C.2 in Modern Data Science with R

Iteration in Python (3 ways)

In HW 1, you modified code that calculated \(\sqrt{x}\) for each \(x = 0, 0.1, 0.2,...,1\). The for loop version of the code was

x <- seq(from=0, to=1, by=0.1)
sqrt_x <- rep(0, length(x))
for(i in 1:length(x)){
  sqrt_x[i] <- sqrt(x[i])
}
sqrt_x

and the vectorized version of the code was

x <- seq(from=0, to=1, by=0.1)
sqrt_x <- sqrt(x)

To help us learn some Python, let’s see how to replicate this code in Python. Like R, Python is a modular language (we can get lots of extra functionality by loading extra libraries). Some of the most common modules (particularly for statistics, data science, and machine learning) are numpy, scipy, pandas, and scikit-learn.

Here we will reproduce the above R code in Python, with help from the numpy module. Run the following Python code (e.g.: in your Quarto document, create a Python chunk, copy the code, run the Python chunk):

import numpy as np

# create x
x = np.linspace(0, 1, 11)

# for loop version
# create sqrt_x for output
sqrt_x = np.zeros(11)
for i in range(11):
  sqrt_x[i] = np.sqrt(x[i])

sqrt_x

# "vectorized" version
sqrt_x = np.sqrt(x)
sqrt_x

# list comprehension version
sqrt_x = [np.sqrt(a) for a in x]
sqrt_x

Some notes:

  • We have loaded the numpy module, and called it np for short. Now when we want to use the numpy module in our code, we will reference np
  • linspace, zeros, and sqrt are functions in the numpy library, so we call them using np.linspace, np.zeros, and np.sqrt
  • np.linspace works pretty similarly to the seq function in R. We specify the range (0 to 1) and the number of steps (11); the result is 0, 0.1, …, 0.9, 1
  • np.zeros(11) is equivalent to rep(0, 11) in R
  • for i in range(11) is essentially equivalent to for(i in 1:11) in R. The only difference is that Python is 0-indexed, so range(11) is really 0, 1, 2, …, 10
  • In R, the basic object is a vector. The closest object to R’s vector is possibly a NumPy array. This allows operations like “vectorized” functions, matrix multiplication, etc.
  • A list comprehension is another way of iterating in Python (and in some sense is the most “Pythonic” approach). The code [np.sqrt(a) for a in x] means “create a list, where each entry of the list is the square root of the corresponding entry in x”.
  • You will notice that, for both the for loop version and the “vectorized” versions of the code, sqrt_x is a numpy array. However, the list comprehension version produces (as we would expect!) a list instead

Question 1

Modify the Python code above to calculate \(x^2\) instead of \(\sqrt{x}\). (In Python, x**2 is the equivalent of x^2 in R). Use the 3 iteration methods discussed above (for loop, vectorization, list comprehension).

Notes:

  • You may use ChatGPT to check your work, but I expect you to make an honest effort to write the code yourself.

Question 2

Modify the code from Question 2 so that instead of considering \(x = 0, 0.1, 0.2, ..., 0.9, 1\) (i.e. the numbers between 0 and 1, in increments of 0.1), we consider \(x = 0, 0.05, 0.10, 0.15, ..., 1.95, 2\) (the numbers between 0 and 2, in increments of 0.05).

Notes:

  • You may use ChatGPT to check your work, but I expect you to make an honest effort to write the code yourself.

Some simulations in Python

In the following questions, you will practice working in Python by re-implementing some of questions from previous assignments in Python. The following table shows you corresponding Python code for several R operations.

R code (approximate) Python equivalent
length len
dim(x) x.shape
seq np.linspace or np.arange
rep(0, n) np.zeros(n)
runif(n, a, b) np.random.uniform(a, b, n)
mean(x) np.mean(x)
for(i in 1:n) for i in range(n):
ifelse(...) np.where(...)
x[i] x[i] (but remember python is 0-indexed)
if(...){

} else {

}
if ... :

else:
rnorm(n, mean=0, sd=0.5) np.random.normal(loc=0, scale=0.5, n)
set.seed(...) np.random.seed(...)
sample(...) np.random.choice(...)

Question 3

Re-do Question 4 from HW 1 in Python. Confirm that you get a similar result.

Notes:

  • You may use ChatGPT to check your work, but I expect you to make an honest effort to write the code yourself.

Question 4

Re-do Question 1 from HW 2 in Python. Confirm that you get a similar result.

Hints:

  • R is pretty flexible with int (an integer) vs. float (a floating point number). E.g., c(1, 2, 3)[1.5] will return something in R. The same is not true in Python; your indices must be integers in Python
  • The np.arange function may be useful. For example, to generate an array of seats numbered 0 to 99, use seats = np.arange(0, 100, 1)
  • Remember that Python is 0-index; if seats has length 100, then seats[0] is the first entry, and seats[99] is the last entry

Notes:

  • You may use ChatGPT to check your work, but I expect you to make an honest effort to write the code yourself.

Functions in Python

Here’s a simple example of a function in R which takes two numbers and adds them together (this is literally what + does, so the function is pointless, but bear with me for the purposes of demonstration):

my_sum <- function(a, b){
  return(a + b)
}

Here is the corresponding function definition in Python:

def my_sum(a, b):
  return a + b

Let’s practice writing a short function in Python by re-writing the Huber loss function from HW 2.

Question 5

Re-do Question 6 from HW 2 in Python. Here is a code skeleton to get you started:

def huber(x, a):

Notes:

  • You may use ChatGPT to check your work, but I expect you to make an honest effort to write the code yourself.

Probability simulation: the Monty Hall problem

Suppose you are on a game show, with three closed doors. Behind two of the doors is a goat, and behind one of the doors is a car. You randomly select one of the three doors. The host then opens one of the remaining doors to reveal a goat, and gives you the option of sticking with your original choice, or switching to the other unopened door. For contestants who choose to switch doors, what is the probability of winning the car?

Question 6

Write a simulation, in Python, to estimate the probability of winning the car by switching doors.

Notes: * You may NOT use ChatGPT, or search for existing Monty Hall simulation code. * You may read about the Monty Hall problem and the mathematical solution (e.g. on Wikipedia). The correct answer is that switching gives a 2/3 chance of winning the car.

Reproducibility

When we do research, we want our work to be reproducible: other researchers should be able to reproduce our results, given the same data. Unfortunately, published research is not always reproducible. In this part of the assignment, you will read about why reproducibility is important, and some good practices for making your own work more reproducible.

To begin, read the following articles at the links provided:

Then answer the following questions.

Question 7

Briefly summarize why reproducibility is important in computational research.

Question 8

What steps can a researcher take to make their work more reproducible?

Question 9

The articles here focus mostly on reproducibility in academic research, particularly published academic research. Suppose you are performing statistical analysis for a private company instead. Why might reproducibility (perhaps by another statistician in your company) still be important?