Final exam review

Below are questions to help you study for the final exam. These are examples of the kinds of questions I might ask.

Writing code

Practice with data wrangling

In each of the questions below, write code (in R or Python) to produce the output from the original data.

  1. Original data:
## # A tibble: 10 × 4
##    species island    bill_length_mm bill_depth_mm
##    <fct>   <fct>              <dbl>         <dbl>
##  1 Gentoo  Biscoe              54.3          15.7
##  2 Gentoo  Biscoe              45.2          14.8
##  3 Adelie  Biscoe              37.7          18.7
##  4 Adelie  Torgersen           39.3          20.6
##  5 Adelie  Dream               36            18.5
##  6 Adelie  Dream               40.2          17.1
##  7 Adelie  Torgersen           38.5          17.9
##  8 Gentoo  Biscoe              43.8          13.9
##  9 Gentoo  Biscoe              51.5          16.3
## 10 Adelie  Dream               41.1          19

Output:

## # A tibble: 4 × 3
##   species island        n
##   <fct>   <fct>     <int>
## 1 Adelie  Biscoe        1
## 2 Adelie  Dream         3
## 3 Adelie  Torgersen     2
## 4 Gentoo  Biscoe        4
  1. Original data:
## # A tibble: 10 × 4
##    species island    bill_length_mm bill_depth_mm
##    <fct>   <fct>              <dbl>         <dbl>
##  1 Gentoo  Biscoe              54.3          15.7
##  2 Gentoo  Biscoe              45.2          14.8
##  3 Adelie  Biscoe              37.7          18.7
##  4 Adelie  Torgersen           39.3          20.6
##  5 Adelie  Dream               36            18.5
##  6 Adelie  Dream               40.2          17.1
##  7 Adelie  Torgersen           38.5          17.9
##  8 Gentoo  Biscoe              43.8          13.9
##  9 Gentoo  Biscoe              51.5          16.3
## 10 Adelie  Dream               41.1          19

Output:

## # A tibble: 4 × 3
## # Groups:   island [3]
##   island    species mean_length
##   <fct>     <fct>         <dbl>
## 1 Biscoe    Adelie         37.7
## 2 Biscoe    Gentoo         48.7
## 3 Dream     Adelie         39.1
## 4 Torgersen Adelie         38.9
  1. Original data:
## # A tibble: 10 × 4
##    species island    bill_length_mm bill_depth_mm
##    <fct>   <fct>              <dbl>         <dbl>
##  1 Gentoo  Biscoe              54.3          15.7
##  2 Gentoo  Biscoe              45.2          14.8
##  3 Adelie  Biscoe              37.7          18.7
##  4 Adelie  Torgersen           39.3          20.6
##  5 Adelie  Dream               36            18.5
##  6 Adelie  Dream               40.2          17.1
##  7 Adelie  Torgersen           38.5          17.9
##  8 Gentoo  Biscoe              43.8          13.9
##  9 Gentoo  Biscoe              51.5          16.3
## 10 Adelie  Dream               41.1          19

Output:

## # A tibble: 10 × 5
##    species island    bill_length_mm bill_depth_mm bill_ratio
##    <fct>   <fct>              <dbl>         <dbl>      <dbl>
##  1 Gentoo  Biscoe              54.3          15.7       3.46
##  2 Gentoo  Biscoe              45.2          14.8       3.05
##  3 Adelie  Biscoe              37.7          18.7       2.02
##  4 Adelie  Torgersen           39.3          20.6       1.91
##  5 Adelie  Dream               36            18.5       1.95
##  6 Adelie  Dream               40.2          17.1       2.35
##  7 Adelie  Torgersen           38.5          17.9       2.15
##  8 Gentoo  Biscoe              43.8          13.9       3.15
##  9 Gentoo  Biscoe              51.5          16.3       3.16
## 10 Adelie  Dream               41.1          19         2.16
  1. Original data:
## # A tibble: 10 × 4
##    species island    bill_length_mm bill_depth_mm
##    <fct>   <fct>              <dbl>         <dbl>
##  1 Gentoo  Biscoe              54.3          15.7
##  2 Gentoo  Biscoe              45.2          14.8
##  3 Adelie  Biscoe              37.7          18.7
##  4 Adelie  Torgersen           39.3          20.6
##  5 Adelie  Dream               36            18.5
##  6 Adelie  Dream               40.2          17.1
##  7 Adelie  Torgersen           38.5          17.9
##  8 Gentoo  Biscoe              43.8          13.9
##  9 Gentoo  Biscoe              51.5          16.3
## 10 Adelie  Dream               41.1          19

Output:

## # A tibble: 3 × 4
##   species island bill_length_mm bill_depth_mm
##   <fct>   <fct>           <dbl>         <dbl>
## 1 Adelie  Dream            36            18.5
## 2 Adelie  Dream            40.2          17.1
## 3 Adelie  Dream            41.1          19
  1. Do this question without explicitly listing the columns.

Original data:

##   x1 x2 x3 y1 y2 y3
## 1  1  a  3  d  2  7
## 2  2  b  1  e  7  1
## 3  3  c  4  f  9  2

Output:

##   x1 x3 y2 y3
## 1  1  3  2  7
## 2  2  1  7  1
## 3  3  4  9  2
  1. Do this question without explicitly listing the columns.

Original data:

##   x1 x2 x3 y1 y2 y3
## 1  1  a  3  d  2  7
## 2  2  b  1  e  7  1
## 3  3  c  4  f  9  2

Output:

##   x1 x2 x3
## 1  1  a  3
## 2  2  b  1
## 3  3  c  4
  1. Do this question without explicitly listing the columns.

Original data:

##   x1 x2 x3 y1 y2 y3
## 1  1  a  3  d  2  7
## 2  2  b  1  e  7  1
## 3  3  c  4  f  9  2

Output:

##   mean_x1  mean_x3
## 1       2 2.666667
  1. Original data:
##   id x_1 x_2 y_1 y_2
## 1  1   3   5   0   2
## 2  2   1   8   1   7
## 3  3   4   9   2   9

Output:

## # A tibble: 12 × 4
##       id group obs   value
##    <dbl> <chr> <chr> <dbl>
##  1     1 x     1         3
##  2     1 x     2         5
##  3     1 y     1         0
##  4     1 y     2         2
##  5     2 x     1         1
##  6     2 x     2         8
##  7     2 y     1         1
##  8     2 y     2         7
##  9     3 x     1         4
## 10     3 x     2         9
## 11     3 y     1         2
## 12     3 y     2         9
  1. Original data:
##   id group value
## 1  1     x     6
## 2  1     y     1
## 3  2     x     4
## 4  2     y     4
## 5  3     x     4
## 6  3     y     1

Output:

## # A tibble: 3 × 3
##      id     x     y
##   <dbl> <int> <int>
## 1     1     6     1
## 2     2     4     4
## 3     3     4     1

Joins

In each of the following questions, write code to produce the desired output from the two input datasets. The code may involve additional wrangling steps, beyond a join.

df1
##   id  x
## 1  1  7
## 2  2  9
## 3  3 13
df2
##   id  y
## 1  1 10
## 2  2 12
## 3  4 14

Output:

##   id  x  y
## 1  1  7 10
## 2  2  9 12
## 3  3 13 NA
df1
##   id  x
## 1  1  7
## 2  2  9
## 3  3 13
df2
##   id  y
## 1  1 10
## 2  2 12
## 3  4 14

Output:

##   id x  y
## 1  1 7 10
## 2  2 9 12
df1
##   a_x a_y b_x b_y
## 1   1   2   2   3
df2
##   id z
## 1  a 4
## 2  b 5
## # A tibble: 2 × 4
##   id        x     y     z
##   <chr> <dbl> <dbl> <dbl>
## 1 a         1     2     4
## 2 b         2     3     5

Regular expressions

Consider the following strings:

## [1] "George Washington: February 22, 1732"
## [2] "Thomas Jefferson: April 13, 1743"    
## [3] "Abraham Lincoln: February 12, 1809"  
## [4] "Theodore Roosevelt: October 27, 1858"

For each question below, fill in the R code to produce the desired output.

str_extract(strings, ...)
## [1] "George Washington"  "Thomas Jefferson"   "Abraham Lincoln"   
## [4] "Theodore Roosevelt"
str_extract(strings, ...)
## [1] "February 22, 1732" "April 13, 1743"    "February 12, 1809"
## [4] "October 27, 1858"
str_extract(strings, ...)
## [1] "Washington" "Jefferson"  "Lincoln"    "Roosevelt"
str_extract(strings, ...)
## [1] "1732" "1743" "1809" "1858"

More regular expressions

Consider the following strings:

strings
##  [1] "apple"           "banana"          "canteloupe"      "durian"         
##  [5] "eggplant"        "french fries"    "goat cheese"     "pizza"          
##  [9] "99 red balloons" "101 dalmatians"  "route 66"

For each question below, fill in the R code to produce the desired output.

str_subset(strings, ...)
## [1] "99 red balloons" "101 dalmatians"  "route 66"
str_subset(strings, ...)
## [1] "99 red balloons" "101 dalmatians"
str_subset(strings, ...)
## [1] "apple"           "banana"          "canteloupe"      "durian"         
## [5] "eggplant"        "goat cheese"     "pizza"           "99 red balloons"
## [9] "101 dalmatians"
str_subset(strings, ...)
## [1] "french fries"    "goat cheese"     "99 red balloons" "101 dalmatians" 
## [5] "route 66"
str_subset(strings, ...)
## [1] "apple"           "eggplant"        "goat cheese"     "pizza"          
## [5] "99 red balloons" "route 66"

Reading code

For each of the following questions, either write the output of the code, or explain why it gives an error. (Some questions will run correctly, others will cause errors)

x <- list()
for(i in 1:10){
  x[i] <- i
}
x[2] + 1
nsim <- 1000 # number of games
results <- rep(NA, nsim)

for(i in 1:nsim){
  # each game starts with the marker in the middle
  marker <- 0
  
  while(abs(marker) < 0){
    robotA <- runif(1, 0, 0.5)
    robotB <- runif(1, 0, 0.5)
    marker <- marker + robotA - robotB
  }
  
  # check whether robot A wins
  results[i] <- marker >= 0.5
}

# fraction of the time that robot A wins
mean(results)
mat <- matrix(0, nrow=5, ncol=3)
for(i in 1:5){
  for(j in 1:3){
    mat <- i + j
  }
}

mat
mat <- matrix(0, nrow=5, ncol=3)
for(i in 1:5){
  for(j in 1:3){
    mat[j, i] <- i + j
  }
}

mat
mat <- matrix(1, 3, 3)
for(i in 2:3){
  for(j in 2:3){
    mat[i,j] <- mat[i-1, j-1] + mat[i, j-1]
  }
}

mat
f1 <- function(x = 1){
  return(x + 1)
}
g1 <- function(x){
  return(f1() + x)
}

f1(g1(3))
f1 <- function(n, groups){
  x <- matrix(1, nrow=n, ncol=n)
  unique_groups = unique(groups)
  means <- matrix(nrow = length(unique_groups), ncol = n)
  for(i in 1:length(unique_groups)){
    means[i,] <- colMeans(x[groups == unique_groups[i],])
  }
  
  return(means)
}

f1(5, groups = c(1, 1, 2, 2, 2))

Improving code efficiency

  1. Re-write the following code to run as efficiently as you can.
x <- c()
for(i in 1:100){
  x <- c(x, runif(1))
}

Probability simulation

You have 40 cards, with 4 different colors. Cards for each color are numbered 1–10. Two cards are picked at random (without replacement). What is the probability that the two cards chosen have different numbers and different colors?

  1. Write a simulation to estimate the probability.

Another probability simulation

An election is held between two candidates. Candidate A wins the election with \(p\) votes, while candidate B loses with \(q < p\) votes. Given these final vote counts, what is the probability that, when the votes are tallied, candidate \(A\) has more votes than candidate \(B\) throughout the count? (So candidate A has more votes after 1 vote has been counted, after 2 votes have been counted, etc.)

  1. Below is code that attempts to estimate this probability, with \(p = 20\) and \(q = 10\). It is very wrong. Explain why.
p <- 20
q <- 10
nsim <- 1000
votes <- rep(c(0, 1), times = c(q, p))
results <- rep(NA, nsim)

for(i in 1:nsim){
  shuffled_votes <- sample(votes, p+q, replace=F)
  results[i] <- sum(shuffled_votes) > sum(1 - shuffled_votes)
}

mean(results)
  1. Write code that correctly estimates the probability, given \(p\) and \(q\).

Statistical simulation

In homework, you learned about the k-means clustering algorithm. Given a dataset and a number of clusters \(k\), the k-means algorithm divides the data into \(k\) groups.

Suppose that you have data \(X_1,...,X_n\). Each observation \(X_i\) comes from one of three groups, with means \(\mu_1, \mu_2, \mu_3\). Let \(G_i\) be the group for \(X_i\), and suppose that \(X_i \sim N(\mu_{G_i}, 1)\).

We observe \(X_1,...,X_n\), but we don’t get to see the actual groups \(G_1,...,G_n\). Instead, we are going to try to estimate the group assignments using the k-means algorithm, with \(k=3\). We want to estimate the probability that we correctly assign groups for all observations \(X_i\), and explore how that probability changes with \(\mu_1, \mu_2\), and \(\mu_3\).

  1. Design a simulation study to answer this question. You do not need to write code, but you must describe each of the ADEMP steps in enough detail that I could implement your simulation study.