Instructions: Work with a neighbor to answer the following questions. To get started, download the class activity template file.
For this class activity, we will create our own small toy dataset, so that we can explore data wrangling functions with a dataset that is easy to view and work with.
Run the following code in R to create the toy dataset:
library(tidyverse)
example_df <- data.frame(
x1 = c(1, 2, 3),
x2 = c("a", "b", "c"),
x3 = c(5, 1, 2),
y1 = c(0, 9, 2),
y2 = c(2, 7, 9),
z = c(0, 0, 0)
)
example_df
## x1 x2 x3 y1 y2 z
## 1 1 a 5 0 2 0
## 2 2 b 1 9 7 0
## 3 3 c 2 2 9 0
Suppose we want to select
only the columns y1
and y2
in this toy dataset. A simple way to code this is:
## y1 y2
## 1 0 2
## 2 9 7
## 3 2 9
However, with multiple columns, it can be difficult or tedious to list them all out by hand! Instead, we can select
columns that meet certain criteria.
For example, we can use the starts_with
function to select all columns which begin with y
:
## y1 y2
## 1 0 2
## 2 9 7
## 3 2 9
x1
, x2
, and x3
, without listing them explicitly.What if we want only the columns which contain characters? (e.g. "a"
, "b"
, etc.) The where
function returns the columns where a given condition is true:
## x2
## 1 a
## 2 b
## 3 c
Modify the code above to select only the numeric columns (hint: use is.numeric
).
Now select only the numeric columns which start with x
(hint: use &
)
Suppose I want the mean of both y1
and y2
. One option, of course, is to list them explicitly in summarize
:
## y1_mean y2_mean
## 1 3.666667 6
However, as before, this gets tedious with multiple columns! Instead, we can use across
to apply a function across multiple columns:
## y1 y2
## 1 3.666667 6
What if I want to apply multiple functions? I specify all the functions I want in a list:
## y1_1 y1_2 y2_1 y2_2
## 1 3.666667 4.725816 6 3.605551
However, the names here aren’t very useful! We can fix this by naming the elements of our list:
## y1_mean y1_sd y2_mean y2_sd
## 1 3.666667 4.725816 6 3.605551
x
. Do not list the columns explicitly.