Instructions: Work with a neighbor to answer the following questions. To get started, download the class activity template file.
Run the following R code to import data on literacy and GDP for each country:
library(tidyverse)
library(googlesheets4)
gs4_deauth()
gdp_data <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")
lit_data = read_sheet("https://docs.google.com/spreadsheets/d/1hDinTIRHQIaZg1RUn6Z_6mo12PtKwEPFIz_mJVF6P5I/pub?gid=0")
As in previous class activities, we can reshape the datasets:
gdp_data <- gdp_data |>
rename(country = starts_with("Income")) |>
pivot_longer(-country, names_to = "year", values_to = "gdp",
values_drop_na = TRUE)
lit_data <- lit_data |>
rename(country = starts_with("Adult")) |>
pivot_longer(-country, names_to = "year", values_to = "literacy",
values_drop_na = TRUE)
## # A tibble: 6 × 4
## country year literacy gdp
## <chr> <chr> <dbl> <dbl>
## 1 Albania 2001 98.3 1282.
## 2 Albania 2008 94.7 1804.
## 3 Albania 2011 95.7 1966.
## 4 Algeria 1987 35.8 1902.
## 5 Algeria 2002 60.1 1872.
## 6 Algeria 2006 63.9 2125.
In the next question, we will work with air quality data. There are two tables: the air_quality
table contains information on different air quality measurements for multiple locations, while the air_parameters
table contains information on the different air quality measurements.
The datasets can be loaded into R as follows:
air_quality <- read.csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_long.csv")
air_parameters <- read.csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_parameters.csv")
And loaded into Python as follows:
import pandas as pd
air_quality = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_long.csv")
air_parameters = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_parameters.csv")
air_quality
table. Write the code in both R and Python.In the next questions, we will return to data on NYC flights in 2013 from the nycflights13
package. We will use two tables from the package: the flights
table, which contains information on each flight, and the planes
table, which contains information on each plane (uniquely identified by tailnum
).
flights
table.Looking at the results from question 3, you will see that there are two year columns in the output of the join. This is because year
is a column in both flights
and planes
. However, we don’t want to join on year, because it represents different information (year of the flight, vs. year of manufacture).
Since we don’t join on year
, and R won’t allow multiple columns with the same name in one data frame, then the output has two year columns: year.x
(from flights
), and year.y
(from planes
). The year.y
column contains information on date of manufacture.
tailnum
) which flew from NYC airports in 2013?