Instructions: Work with a neighbor to answer the following questions. To get started, download the class activity template file.
Run the following code to import data on GDP for each country (you may need to install the googlesheets4
package):
library(tidyverse)
library(googlesheets4)
gs4_deauth()
gdp_data <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")
Looking at the data, we can see that there is a column for the GDP in each year, and the first column contains country information:
## # A tibble: 6 × 53
## Income per person (f…¹ `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abkhazia NA NA NA NA NA NA NA NA
## 2 Afghanistan NA NA NA NA NA NA NA NA
## 3 Akrotiri and Dhekelia NA NA NA NA NA NA NA NA
## 4 Albania NA NA NA NA NA NA NA NA
## 5 Algeria 1280. 1085. 856. 1128. 1170. 1215. 1128. 1201.
## 6 American Samoa NA NA NA NA NA NA NA NA
## # ℹ abbreviated name: ¹`Income per person (fixed 2000 US$)`
## # ℹ 44 more variables: `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>,
## # `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>,
## # `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
## # `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
## # `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
## # `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, …
pivot_longer
function to reshape the GDP data so it looks like this:## # A tibble: 7,988 × 3
## country year gdp
## <chr> <chr> <dbl>
## 1 Albania 1980 1061.
## 2 Albania 1981 1100.
## 3 Albania 1982 1111.
## 4 Albania 1983 1101.
## 5 Albania 1984 1065.
## 6 Albania 1985 1060.
## 7 Albania 1986 1092.
## 8 Albania 1987 1054.
## 9 Albania 1988 1014.
## 10 Albania 1989 1092.
## # ℹ 7,978 more rows
country
year
and gdp
The tidyr
package contains a dataset called who2
which contains information on tuberculosis counts for each country in each year, broken down by method of diagnosis, sex, and age group:
## # A tibble: 7,240 × 58
## country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1980 NA NA NA NA NA NA
## 2 Afghanistan 1981 NA NA NA NA NA NA
## 3 Afghanistan 1982 NA NA NA NA NA NA
## 4 Afghanistan 1983 NA NA NA NA NA NA
## 5 Afghanistan 1984 NA NA NA NA NA NA
## 6 Afghanistan 1985 NA NA NA NA NA NA
## 7 Afghanistan 1986 NA NA NA NA NA NA
## 8 Afghanistan 1987 NA NA NA NA NA NA
## 9 Afghanistan 1988 NA NA NA NA NA NA
## 10 Afghanistan 1989 NA NA NA NA NA NA
## # ℹ 7,230 more rows
## # ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
## # sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
## # sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
## # sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
## # sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
## # sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …
Columns like sp_m_1524
contain the number of tuberculosis cases. The column names contain information about the method of diagnosis (rel
, sn
, sp
, or ep
), sex (f
or m
), and age group (014
= 0-14 years, 1524
= 15-24 years, etc.).
Ideally, we would like these variables to each have their own column.
who2
data so it looks like this:## # A tibble: 76,046 × 6
## country year diagnosis sex age_group count
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1997 sp m 1524 10
## 3 Afghanistan 1997 sp m 2534 6
## 4 Afghanistan 1997 sp m 3544 3
## 5 Afghanistan 1997 sp m 4554 5
## 6 Afghanistan 1997 sp m 5564 2
## 7 Afghanistan 1997 sp m 65 0
## 8 Afghanistan 1997 sp f 014 5
## 9 Afghanistan 1997 sp f 1524 38
## 10 Afghanistan 1997 sp f 2534 36
## # ℹ 76,036 more rows
diagnosis
, sex
, and age_group
count
column contains the number of casesThe household
data from the tidyr
package contains information about five different households and the children in each household:
## # A tibble: 5 × 5
## family dob_child1 dob_child2 name_child1 name_child2
## <int> <date> <date> <chr> <chr>
## 1 1 1998-11-26 2000-01-29 Susan Jose
## 2 2 1996-06-22 NA Mark <NA>
## 3 3 2002-07-11 2004-04-05 Sam Seth
## 4 4 2004-10-10 2009-08-27 Craig Khai
## 5 5 2000-12-05 2005-02-28 Parker Gracie
".value"
in the names_to
argument of the pivot_longer
function, reshape the household
data to look like this:## # A tibble: 9 × 4
## family child dob name
## <int> <chr> <date> <chr>
## 1 1 child1 1998-11-26 Susan
## 2 1 child2 2000-01-29 Jose
## 3 2 child1 1996-06-22 Mark
## 4 3 child1 2002-07-11 Sam
## 5 3 child2 2004-04-05 Seth
## 6 4 child1 2004-10-10 Craig
## 7 4 child2 2009-08-27 Khai
## 8 5 child1 2000-12-05 Parker
## 9 5 child2 2005-02-28 Gracie