Instructions: Work with a neighbor to answer the following questions. To get started, download the class activity template file.

GDP data

Run the following R code to import data on GDP for each country (you may need to install the googlesheets4 package):

library(tidyverse)
library(googlesheets4)
gs4_deauth()
gdp_data <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")

Then import the data into Python:

import pandas as pd

gdp_data = r.gdp_data

Use the melt function in pandas to reshape the GDP data so it looks like this:

##          Country  year           gdp
## 4        Algeria  1960   1280.384828
## 10     Argentina  1960   5251.876754
## 13     Australia  1960   9407.685082
## 14       Austria  1960   7434.183717
## 16       Bahamas  1960  11926.460965
## ...          ...   ...           ...
## 14271  Venezuela  2011   5671.912202
## 14275    Vietnam  2011    757.400929
## 14279      Yemen  2011    528.398673
## 14281     Zambia  2011    443.920631
## 14282   Zimbabwe  2011    347.745661
## 
## [7988 rows x 3 columns]

There should be no NAs
The first column should be renamed country
There should be separate columns for year and gdp

WHO data

The tidyr package contains a dataset called who2 which contains information on tuberculosis counts for each country in each year, broken down by method of diagnosis, sex, and age group:

who2

## # A tibble: 7,240 × 58
##    country      year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
##    <chr>       <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 Afghanistan  1980       NA        NA        NA        NA        NA        NA
##  2 Afghanistan  1981       NA        NA        NA        NA        NA        NA
##  3 Afghanistan  1982       NA        NA        NA        NA        NA        NA
##  4 Afghanistan  1983       NA        NA        NA        NA        NA        NA
##  5 Afghanistan  1984       NA        NA        NA        NA        NA        NA
##  6 Afghanistan  1985       NA        NA        NA        NA        NA        NA
##  7 Afghanistan  1986       NA        NA        NA        NA        NA        NA
##  8 Afghanistan  1987       NA        NA        NA        NA        NA        NA
##  9 Afghanistan  1988       NA        NA        NA        NA        NA        NA
## 10 Afghanistan  1989       NA        NA        NA        NA        NA        NA
## # ℹ 7,230 more rows
## # ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
## #   sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
## #   sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
## #   sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
## #   sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
## #   sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …

Columns like sp_m_1524 contain the number of tuberculosis cases. The column names contain information about the method of diagnosis (rel, sn, sp, or ep), sex (f or m), and age group (014 = 0-14 years, 1524 = 15-24 years, etc.).

Ideally, we would like these variables to each have their own column.

Import the who2 data into Python:

who2 = r.who2

Reshape the who2 data in Python so it looks like this:

##                           country    year   count diagnosis sex age_group
## 17                    Afghanistan  1997.0     0.0        sp   m       014
## 18                    Afghanistan  1998.0    30.0        sp   m       014
## 19                    Afghanistan  1999.0     8.0        sp   m       014
## 20                    Afghanistan  2000.0    52.0        sp   m       014
## 21                    Afghanistan  2001.0   129.0        sp   m       014
## ...                           ...     ...     ...       ...  ..       ...
## 405269                   Viet Nam  2013.0  3110.0       rel   f        65
## 405303  Wallis and Futuna Islands  2013.0     2.0       rel   f        65
## 405371                      Yemen  2013.0   360.0       rel   f        65
## 405405                     Zambia  2013.0   669.0       rel   f        65
## 405439                   Zimbabwe  2013.0   725.0       rel   f        65
## 
## [76046 rows x 6 columns]

There should be no NAs
There should be separate columns for diagnosis, sex, and age_group
The count column contains the number of cases

household data

The household data below contains information about five different households and the children in each household:

household = pd.DataFrame({
  'family': [1, 2, 3, 4, 5],
  'dob_child1': ["1998-11-26", "1996-06-22", "2002-07-11", "2004-10-10", "2000-12-05"],
  'dob_child2': ["2000-01-29", pd.NA, "2004-04-05", "2009-08-27", "2005-02-28"],
  'name_child1': ["Susan", "Mark", "Sam", "Craig", "Parker"],
  'name_child2': ["Jose", pd.NA, "Seth", "Khai", "Gracie"]
})

Reshape the household data in Python so it looks like this:

## var                   dob    name
## family child                     
## 1      child1  1998-11-26   Susan
##        child2  2000-01-29    Jose
## 2      child1  1996-06-22    Mark
## 3      child1  2002-07-11     Sam
##        child2  2004-04-05    Seth
## 4      child1  2004-10-10   Craig
##        child2  2009-08-27    Khai
## 5      child1  2000-12-05  Parker
##        child2  2005-02-28  Gracie

Class Activity

GDP data

WHO data

household data