Below are questions to help you study for the final exam. These are examples of the kinds of questions I might ask.
In each of the questions below, write code (in R or Python) to produce the output from the original data.
## # A tibble: 10 × 4
## species island bill_length_mm bill_depth_mm
## <fct> <fct> <dbl> <dbl>
## 1 Gentoo Biscoe 54.3 15.7
## 2 Gentoo Biscoe 45.2 14.8
## 3 Adelie Biscoe 37.7 18.7
## 4 Adelie Torgersen 39.3 20.6
## 5 Adelie Dream 36 18.5
## 6 Adelie Dream 40.2 17.1
## 7 Adelie Torgersen 38.5 17.9
## 8 Gentoo Biscoe 43.8 13.9
## 9 Gentoo Biscoe 51.5 16.3
## 10 Adelie Dream 41.1 19
Output:
## # A tibble: 4 × 3
## species island n
## <fct> <fct> <int>
## 1 Adelie Biscoe 1
## 2 Adelie Dream 3
## 3 Adelie Torgersen 2
## 4 Gentoo Biscoe 4
## # A tibble: 10 × 4
## species island bill_length_mm bill_depth_mm
## <fct> <fct> <dbl> <dbl>
## 1 Gentoo Biscoe 54.3 15.7
## 2 Gentoo Biscoe 45.2 14.8
## 3 Adelie Biscoe 37.7 18.7
## 4 Adelie Torgersen 39.3 20.6
## 5 Adelie Dream 36 18.5
## 6 Adelie Dream 40.2 17.1
## 7 Adelie Torgersen 38.5 17.9
## 8 Gentoo Biscoe 43.8 13.9
## 9 Gentoo Biscoe 51.5 16.3
## 10 Adelie Dream 41.1 19
Output:
## # A tibble: 4 × 3
## # Groups: island [3]
## island species mean_length
## <fct> <fct> <dbl>
## 1 Biscoe Adelie 37.7
## 2 Biscoe Gentoo 48.7
## 3 Dream Adelie 39.1
## 4 Torgersen Adelie 38.9
## # A tibble: 10 × 4
## species island bill_length_mm bill_depth_mm
## <fct> <fct> <dbl> <dbl>
## 1 Gentoo Biscoe 54.3 15.7
## 2 Gentoo Biscoe 45.2 14.8
## 3 Adelie Biscoe 37.7 18.7
## 4 Adelie Torgersen 39.3 20.6
## 5 Adelie Dream 36 18.5
## 6 Adelie Dream 40.2 17.1
## 7 Adelie Torgersen 38.5 17.9
## 8 Gentoo Biscoe 43.8 13.9
## 9 Gentoo Biscoe 51.5 16.3
## 10 Adelie Dream 41.1 19
Output:
## # A tibble: 10 × 5
## species island bill_length_mm bill_depth_mm bill_ratio
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Gentoo Biscoe 54.3 15.7 3.46
## 2 Gentoo Biscoe 45.2 14.8 3.05
## 3 Adelie Biscoe 37.7 18.7 2.02
## 4 Adelie Torgersen 39.3 20.6 1.91
## 5 Adelie Dream 36 18.5 1.95
## 6 Adelie Dream 40.2 17.1 2.35
## 7 Adelie Torgersen 38.5 17.9 2.15
## 8 Gentoo Biscoe 43.8 13.9 3.15
## 9 Gentoo Biscoe 51.5 16.3 3.16
## 10 Adelie Dream 41.1 19 2.16
## # A tibble: 10 × 4
## species island bill_length_mm bill_depth_mm
## <fct> <fct> <dbl> <dbl>
## 1 Gentoo Biscoe 54.3 15.7
## 2 Gentoo Biscoe 45.2 14.8
## 3 Adelie Biscoe 37.7 18.7
## 4 Adelie Torgersen 39.3 20.6
## 5 Adelie Dream 36 18.5
## 6 Adelie Dream 40.2 17.1
## 7 Adelie Torgersen 38.5 17.9
## 8 Gentoo Biscoe 43.8 13.9
## 9 Gentoo Biscoe 51.5 16.3
## 10 Adelie Dream 41.1 19
Output:
## # A tibble: 3 × 4
## species island bill_length_mm bill_depth_mm
## <fct> <fct> <dbl> <dbl>
## 1 Adelie Dream 36 18.5
## 2 Adelie Dream 40.2 17.1
## 3 Adelie Dream 41.1 19
Original data:
## x1 x2 x3 y1 y2 y3
## 1 1 a 3 d 2 7
## 2 2 b 1 e 7 1
## 3 3 c 4 f 9 2
Output:
## x1 x3 y2 y3
## 1 1 3 2 7
## 2 2 1 7 1
## 3 3 4 9 2
Original data:
## x1 x2 x3 y1 y2 y3
## 1 1 a 3 d 2 7
## 2 2 b 1 e 7 1
## 3 3 c 4 f 9 2
Output:
## x1 x2 x3
## 1 1 a 3
## 2 2 b 1
## 3 3 c 4
Original data:
## x1 x2 x3 y1 y2 y3
## 1 1 a 3 d 2 7
## 2 2 b 1 e 7 1
## 3 3 c 4 f 9 2
Output:
## mean_x1 mean_x3
## 1 2 2.666667
## id x_1 x_2 y_1 y_2
## 1 1 3 5 0 2
## 2 2 1 8 1 7
## 3 3 4 9 2 9
Output:
## # A tibble: 12 × 4
## id group obs value
## <dbl> <chr> <chr> <dbl>
## 1 1 x 1 3
## 2 1 x 2 5
## 3 1 y 1 0
## 4 1 y 2 2
## 5 2 x 1 1
## 6 2 x 2 8
## 7 2 y 1 1
## 8 2 y 2 7
## 9 3 x 1 4
## 10 3 x 2 9
## 11 3 y 1 2
## 12 3 y 2 9
## id group value
## 1 1 x 6
## 2 1 y 1
## 3 2 x 4
## 4 2 y 4
## 5 3 x 4
## 6 3 y 1
Output:
## # A tibble: 3 × 3
## id x y
## <dbl> <int> <int>
## 1 1 6 1
## 2 2 4 4
## 3 3 4 1
In each of the following questions, write code to produce the desired output from the two input datasets. The code may involve additional wrangling steps, beyond a join.
## id x
## 1 1 7
## 2 2 9
## 3 3 13
## id y
## 1 1 10
## 2 2 12
## 3 4 14
Output:
## id x y
## 1 1 7 10
## 2 2 9 12
## 3 3 13 NA
## id x
## 1 1 7
## 2 2 9
## 3 3 13
## id y
## 1 1 10
## 2 2 12
## 3 4 14
Output:
## id x y
## 1 1 7 10
## 2 2 9 12
## a_x a_y b_x b_y
## 1 1 2 2 3
## id z
## 1 a 4
## 2 b 5
## # A tibble: 2 × 4
## id x y z
## <chr> <dbl> <dbl> <dbl>
## 1 a 1 2 4
## 2 b 2 3 5
Consider the following strings:
## [1] "George Washington: February 22, 1732"
## [2] "Thomas Jefferson: April 13, 1743"
## [3] "Abraham Lincoln: February 12, 1809"
## [4] "Theodore Roosevelt: October 27, 1858"
For each question below, fill in the R code to produce the desired output.
## [1] "George Washington" "Thomas Jefferson" "Abraham Lincoln"
## [4] "Theodore Roosevelt"
## [1] "February 22, 1732" "April 13, 1743" "February 12, 1809"
## [4] "October 27, 1858"
## [1] "Washington" "Jefferson" "Lincoln" "Roosevelt"
## [1] "1732" "1743" "1809" "1858"
Consider the following strings:
## [1] "apple" "banana" "canteloupe" "durian"
## [5] "eggplant" "french fries" "goat cheese" "pizza"
## [9] "99 red balloons" "101 dalmatians" "route 66"
For each question below, fill in the R code to produce the desired output.
## [1] "99 red balloons" "101 dalmatians" "route 66"
## [1] "99 red balloons" "101 dalmatians"
## [1] "apple" "banana" "canteloupe" "durian"
## [5] "eggplant" "goat cheese" "pizza" "99 red balloons"
## [9] "101 dalmatians"
## [1] "french fries" "goat cheese" "99 red balloons" "101 dalmatians"
## [5] "route 66"
## [1] "apple" "eggplant" "goat cheese" "pizza"
## [5] "99 red balloons" "route 66"
For each of the following questions, either write the output of the code, or explain why it gives an error. (Some questions will run correctly, others will cause errors)
nsim <- 1000 # number of games
results <- rep(NA, nsim)
for(i in 1:nsim){
# each game starts with the marker in the middle
marker <- 0
while(abs(marker) < 0){
robotA <- runif(1, 0, 0.5)
robotB <- runif(1, 0, 0.5)
marker <- marker + robotA - robotB
}
# check whether robot A wins
results[i] <- marker >= 0.5
}
# fraction of the time that robot A wins
mean(results)
mat <- matrix(1, 3, 3)
for(i in 2:3){
for(j in 2:3){
mat[i,j] <- mat[i-1, j-1] + mat[i, j-1]
}
}
mat
You have 40 cards, with 4 different colors. Cards for each color are numbered 1–10. Two cards are picked at random (without replacement). What is the probability that the two cards chosen have different numbers and different colors?
An election is held between two candidates. Candidate A wins the election with \(p\) votes, while candidate B loses with \(q < p\) votes. Given these final vote counts, what is the probability that, when the votes are tallied, candidate \(A\) has more votes than candidate \(B\) throughout the count? (So candidate A has more votes after 1 vote has been counted, after 2 votes have been counted, etc.)
p <- 20
q <- 10
nsim <- 1000
votes <- rep(c(0, 1), times = c(q, p))
results <- rep(NA, nsim)
for(i in 1:nsim){
shuffled_votes <- sample(votes, p+q, replace=F)
results[i] <- sum(shuffled_votes) > sum(1 - shuffled_votes)
}
mean(results)
In homework, you learned about the k-means clustering algorithm. Given a dataset and a number of clusters \(k\), the k-means algorithm divides the data into \(k\) groups.
Suppose that you have data \(X_1,...,X_n\). Each observation \(X_i\) comes from one of three groups, with means \(\mu_1, \mu_2, \mu_3\). Let \(G_i\) be the group for \(X_i\), and suppose that \(X_i \sim N(\mu_{G_i}, 1)\).
We observe \(X_1,...,X_n\), but we don’t get to see the actual groups \(G_1,...,G_n\). Instead, we are going to try to estimate the group assignments using the k-means algorithm, with \(k=3\). We want to estimate the probability that we correctly assign groups for all observations \(X_i\), and explore how that probability changes with \(\mu_1, \mu_2\), and \(\mu_3\).