Class Activity

Instructions: Work with a neighbor to answer the following questions. To get started, download the class activity template file.

Data

For this class activity, we will create our own small toy dataset, so that we can explore data wrangling functions with a dataset that is easy to view and work with.

Run the following code in R to create the toy dataset:

library(tidyverse)

example_df <- data.frame(
  x1 = c(1, 2, 3),
  x2 = c("a", "b", "c"),
  x3 = c(5, 1, 2),
  y1 = c(0, 9, 2),
  y2 = c(2, 7, 9),
  z = c(0, 0, 0)
)

example_df

##   x1 x2 x3 y1 y2 z
## 1  1  a  5  0  2 0
## 2  2  b  1  9  7 0
## 3  3  c  2  2  9 0

Efficiently selecting multiple columns

Suppose we want to select only the columns y1 and y2 in this toy dataset. A simple way to code this is:

example_df |>
  select(y1, y2)

##   y1 y2
## 1  0  2
## 2  9  7
## 3  2  9

However, with multiple columns, it can be difficult or tedious to list them all out by hand! Instead, we can select columns that meet certain criteria.

For example, we can use the starts_with function to select all columns which begin with y:

example_df |>
  select(starts_with("y"))

##   y1 y2
## 1  0  2
## 2  9  7
## 3  2  9

Modify the code above to select the columns x1, x2, and x3, without listing them explicitly.

What if we want only the columns which contain characters? (e.g. "a", "b", etc.) The where function returns the columns where a given condition is true:

example_df |>
  select(where(is.character))

##   x2
## 1  a
## 2  b
## 3  c

Modify the code above to select only the numeric columns (hint: use is.numeric).
Now select only the numeric columns which start with x (hint: use &)

Applying functions to multiple columns

Suppose I want the mean of both y1 and y2. One option, of course, is to list them explicitly in summarize:

example_df |>
  summarize(y1_mean = mean(y1),
            y2_mean = mean(y2))

##    y1_mean y2_mean
## 1 3.666667       6

However, as before, this gets tedious with multiple columns! Instead, we can use across to apply a function across multiple columns:

example_df |>
  summarize(across(starts_with("y"), mean))

##         y1 y2
## 1 3.666667  6

What if I want to apply multiple functions? I specify all the functions I want in a list:

example_df |>
  summarize(across(starts_with("y"), list(mean, sd)))

##       y1_1     y1_2 y2_1     y2_2
## 1 3.666667 4.725816    6 3.605551

However, the names here aren’t very useful! We can fix this by naming the elements of our list:

example_df |>
  summarize(across(starts_with("y"), list("mean" = mean, "sd" = sd)))

##    y1_mean    y1_sd y2_mean    y2_sd
## 1 3.666667 4.725816       6 3.605551

Modify the code above to calculate the median and IQR for the numeric columns which start with x. Do not list the columns explicitly.