Instructions: Work with a neighbor to answer the following questions. To get started, download the class activity template file.
Run the following R code to import data on GDP for each country (you may need to install the googlesheets4
package):
library(tidyverse)
library(googlesheets4)
gs4_deauth()
gdp_data <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")
Then import the data into Python:
melt
function in pandas to reshape the GDP data so it looks like this:## Country year gdp
## 4 Algeria 1960 1280.384828
## 10 Argentina 1960 5251.876754
## 13 Australia 1960 9407.685082
## 14 Austria 1960 7434.183717
## 16 Bahamas 1960 11926.460965
## ... ... ... ...
## 14271 Venezuela 2011 5671.912202
## 14275 Vietnam 2011 757.400929
## 14279 Yemen 2011 528.398673
## 14281 Zambia 2011 443.920631
## 14282 Zimbabwe 2011 347.745661
##
## [7988 rows x 3 columns]
country
year
and gdp
The tidyr
package contains a dataset called who2
which contains information on tuberculosis counts for each country in each year, broken down by method of diagnosis, sex, and age group:
## # A tibble: 7,240 × 58
## country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1980 NA NA NA NA NA NA
## 2 Afghanistan 1981 NA NA NA NA NA NA
## 3 Afghanistan 1982 NA NA NA NA NA NA
## 4 Afghanistan 1983 NA NA NA NA NA NA
## 5 Afghanistan 1984 NA NA NA NA NA NA
## 6 Afghanistan 1985 NA NA NA NA NA NA
## 7 Afghanistan 1986 NA NA NA NA NA NA
## 8 Afghanistan 1987 NA NA NA NA NA NA
## 9 Afghanistan 1988 NA NA NA NA NA NA
## 10 Afghanistan 1989 NA NA NA NA NA NA
## # ℹ 7,230 more rows
## # ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
## # sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
## # sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
## # sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
## # sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
## # sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …
Columns like sp_m_1524
contain the number of tuberculosis cases. The column names contain information about the method of diagnosis (rel
, sn
, sp
, or ep
), sex (f
or m
), and age group (014
= 0-14 years, 1524
= 15-24 years, etc.).
Ideally, we would like these variables to each have their own column.
Import the who2
data into Python:
who2
data in Python so it looks like this:## country year count diagnosis sex age_group
## 17 Afghanistan 1997.0 0.0 sp m 014
## 18 Afghanistan 1998.0 30.0 sp m 014
## 19 Afghanistan 1999.0 8.0 sp m 014
## 20 Afghanistan 2000.0 52.0 sp m 014
## 21 Afghanistan 2001.0 129.0 sp m 014
## ... ... ... ... ... .. ...
## 405269 Viet Nam 2013.0 3110.0 rel f 65
## 405303 Wallis and Futuna Islands 2013.0 2.0 rel f 65
## 405371 Yemen 2013.0 360.0 rel f 65
## 405405 Zambia 2013.0 669.0 rel f 65
## 405439 Zimbabwe 2013.0 725.0 rel f 65
##
## [76046 rows x 6 columns]
diagnosis
, sex
, and age_group
count
column contains the number of casesThe household
data below contains information about five different households and the children in each household:
household = pd.DataFrame({
'family': [1, 2, 3, 4, 5],
'dob_child1': ["1998-11-26", "1996-06-22", "2002-07-11", "2004-10-10", "2000-12-05"],
'dob_child2': ["2000-01-29", pd.NA, "2004-04-05", "2009-08-27", "2005-02-28"],
'name_child1': ["Susan", "Mark", "Sam", "Craig", "Parker"],
'name_child2': ["Jose", pd.NA, "Seth", "Khai", "Gracie"]
})
household
data in Python so it looks like this:## var dob name
## family child
## 1 child1 1998-11-26 Susan
## child2 2000-01-29 Jose
## 2 child1 1996-06-22 Mark
## 3 child1 2002-07-11 Sam
## child2 2004-04-05 Seth
## 4 child1 2004-10-10 Craig
## child2 2009-08-27 Khai
## 5 child1 2000-12-05 Parker
## child2 2005-02-28 Gracie