Homework 4
Due: Friday, September 29, 11:00am on Canvas
Instructions:
- Download the HW 4 template, and open the template (a Quarto document) in RStudio.
- Put your name in the file header
- Click
Render
- Type all code and answers in the document (using
###
for section headings and####
for question headings) - Render early and often to catch any errors!
- When you are finished, submit the final rendered HTML to Canvas
Code guidelines:
- If a question requires code, and code is not provided, you will not receive full credit
- You will be graded on the quality of your code. In addition to being
correct, your code should also be easy to read
- No magic numbers
- Use descriptive names for your variables
- Set seeds where needed
- Comment code
- If a block of code is being called multiple times, put it in a function
Resources: In addition to the class notes and activities, I recommend reading the following resources:
- Chapter 4 (NumPy basics) in Python for Data Analysis (3rd edition)
- Appendix C.2 in Modern Data Science with R
Iteration in Python (3 ways)
In HW 1, you modified code that calculated \(\sqrt{x}\) for each \(x = 0, 0.1, 0.2,...,1\). The
for
loop version of the code was
x <- seq(from=0, to=1, by=0.1)
sqrt_x <- rep(0, length(x))
for(i in 1:length(x)){
sqrt_x[i] <- sqrt(x[i])
}
sqrt_x
and the vectorized version of the code was
To help us learn some Python, let’s see how to replicate this code in
Python. Like R, Python is a modular language (we can get lots of extra
functionality by loading extra libraries). Some of the most common
modules (particularly for statistics, data science, and machine
learning) are numpy
, scipy
,
pandas
, and scikit-learn
.
Here we will reproduce the above R code in Python, with help from the
numpy
module. Run the following Python code (e.g.: in your
Quarto document, create a Python chunk, copy the code, run the Python
chunk):
import numpy as np
# create x
x = np.linspace(0, 1, 11)
# for loop version
# create sqrt_x for output
sqrt_x = np.zeros(11)
for i in range(11):
sqrt_x[i] = np.sqrt(x[i])
sqrt_x
# "vectorized" version
sqrt_x = np.sqrt(x)
sqrt_x
# list comprehension version
sqrt_x = [np.sqrt(a) for a in x]
sqrt_x
Some notes:
- We have loaded the
numpy
module, and called itnp
for short. Now when we want to use thenumpy
module in our code, we will referencenp
linspace
,zeros
, andsqrt
are functions in thenumpy
library, so we call them usingnp.linspace
,np.zeros
, andnp.sqrt
np.linspace
works pretty similarly to theseq
function in R. We specify the range (0 to 1) and the number of steps (11); the result is 0, 0.1, …, 0.9, 1np.zeros(11)
is equivalent torep(0, 11)
in Rfor i in range(11)
is essentially equivalent tofor(i in 1:11)
in R. The only difference is that Python is 0-indexed, sorange(11)
is really 0, 1, 2, …, 10- In R, the basic object is a vector. The closest object to R’s vector is possibly a NumPy array. This allows operations like “vectorized” functions, matrix multiplication, etc.
- A list comprehension is another way of iterating in Python (and in
some sense is the most “Pythonic” approach). The code
[np.sqrt(a) for a in x]
means “create a list, where each entry of the list is the square root of the corresponding entry inx
”. - You will notice that, for both the
for
loop version and the “vectorized” versions of the code,sqrt_x
is a numpyarray
. However, the list comprehension version produces (as we would expect!) a list instead
Question 1
Modify the Python code above to calculate \(x^2\) instead of \(\sqrt{x}\). (In Python, x**2
is the equivalent of x^2
in R). Use the 3 iteration methods
discussed above (for loop, vectorization, list comprehension).
Notes:
- You may use ChatGPT to check your work, but I expect you to make an honest effort to write the code yourself.
Question 2
Modify the code from Question 2 so that instead of considering \(x = 0, 0.1, 0.2, ..., 0.9, 1\) (i.e. the numbers between 0 and 1, in increments of 0.1), we consider \(x = 0, 0.05, 0.10, 0.15, ..., 1.95, 2\) (the numbers between 0 and 2, in increments of 0.05).
Notes:
- You may use ChatGPT to check your work, but I expect you to make an honest effort to write the code yourself.
Some simulations in Python
In the following questions, you will practice working in Python by re-implementing some of questions from previous assignments in Python. The following table shows you corresponding Python code for several R operations.
R code | (approximate) Python equivalent |
---|---|
length |
len |
dim(x) |
x.shape |
seq |
np.linspace or np.arange |
rep(0, n) |
np.zeros(n) |
runif(n, a, b) |
np.random.uniform(a, b, n) |
mean(x) |
np.mean(x) |
for(i in 1:n) |
for i in range(n): |
ifelse(...) |
np.where(...) |
x[i] |
x[i] (but remember python is 0-indexed) |
if(...){ } else { } |
if ... : else: |
rnorm(n, mean=0, sd=0.5) |
np.random.normal(loc=0, scale=0.5, n) |
set.seed(...) |
np.random.seed(...) |
sample(...) |
np.random.choice(...) |
Question 3
Re-do Question 4 from HW 1 in Python. Confirm that you get a similar result.
Notes:
- You may use ChatGPT to check your work, but I expect you to make an honest effort to write the code yourself.
Question 4
Re-do Question 1 from HW 2 in Python. Confirm that you get a similar result.
Hints:
- R is pretty flexible with
int
(an integer) vs.float
(a floating point number). E.g.,c(1, 2, 3)[1.5]
will return something in R. The same is not true in Python; your indices must be integers in Python - The
np.arange
function may be useful. For example, to generate an array of seats numbered 0 to 99, useseats = np.arange(0, 100, 1)
- Remember that Python is 0-index; if
seats
has length 100, thenseats[0]
is the first entry, andseats[99]
is the last entry
Notes:
- You may use ChatGPT to check your work, but I expect you to make an honest effort to write the code yourself.
Functions in Python
Here’s a simple example of a function in R which takes two numbers
and adds them together (this is literally what +
does, so
the function is pointless, but bear with me for the purposes of
demonstration):
Here is the corresponding function definition in Python:
Let’s practice writing a short function in Python by re-writing the Huber loss function from HW 2.
Probability simulation: the Monty Hall problem
Suppose you are on a game show, with three closed doors. Behind two of the doors is a goat, and behind one of the doors is a car. You randomly select one of the three doors. The host then opens one of the remaining doors to reveal a goat, and gives you the option of sticking with your original choice, or switching to the other unopened door. For contestants who choose to switch doors, what is the probability of winning the car?
Question 6
Write a simulation, in Python, to estimate the probability of winning the car by switching doors.
Notes: * You may NOT use ChatGPT, or search for existing Monty Hall simulation code. * You may read about the Monty Hall problem and the mathematical solution (e.g. on Wikipedia). The correct answer is that switching gives a 2/3 chance of winning the car.
Reproducibility
When we do research, we want our work to be reproducible: other researchers should be able to reproduce our results, given the same data. Unfortunately, published research is not always reproducible. In this part of the assignment, you will read about why reproducibility is important, and some good practices for making your own work more reproducible.
To begin, read the following articles at the links provided:
- Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226-1227.
- Broman, K., Cetinkaya-Rundel, M., Nussbaum, A., Paciorek, C., Peng, R., Turek, D., & Wickham, H. (2017). Recommendations to funding agencies for supporting reproducible research. American Statistical Association.
Then answer the following questions.
Question 7
Briefly summarize why reproducibility is important in computational research.
Question 8
What steps can a researcher take to make their work more reproducible?
Question 9
The articles here focus mostly on reproducibility in academic research, particularly published academic research. Suppose you are performing statistical analysis for a private company instead. Why might reproducibility (perhaps by another statistician in your company) still be important?