38.8 Mb
The nycflights13
package contains information on flights from NYC airports in 2013. The data is stored across several data frames:
airlines
: information on each airlineairports
: information on each airportflights
: information on each flightplanes
: information on each planeweather
: hourly weather dataThe nycflights13
package contains a small subset of a database on 48 million flights. The airlines
database includes the following tables:
airports
carriers
flights
planes
This data is too big to store locally, but can be on servers which we can access remotely.
https://sta279-f23.github.io/class_activities/ca_lecture_18_warmup.html
What do you think each part of this query is doing?
name N pct_ontime
1 Delta Air Lines Inc. 2396 0.8689
2 Virgin America 347 0.8329
3 JetBlue Airways 3463 0.8169
4 American Airlines Inc. 1397 0.7817
carrier name
1 02Q Titan Airways
2 04Q Tradewind Aviation
3 05Q Comlux Aviation, AG
4 06Q Master Top Linhas Aereas Ltd.
5 07Q Flair Airlines Ltd.
6 09Q Swift Air, LLC
7 0BQ DCA
8 0CQ ACM AIR CHARTER GmbH
9 0GQ Inter Island Airways, d/b/a Inter Island Air
10 0HQ Polar Airlines de Mexico d/b/a Nova Air
SELECT
: the columns to be retrievedFROM
: the table containing the dataLIMIT
: limit the rows to returnWhat if I want the year
, origin
, dest
, dep_delay
, and arr_delay
columns from the flights
table?
What if I want the year
, origin
, dest
, dep_delay
, and arr_delay
columns from the flights
table?
year origin dest dep_delay arr_delay
1 2010 EWR OMA 181 159
2 2010 FLL SWF 281 256
3 2010 JFK SJU 8 5
4 2010 IAD BNA 125 112
5 2010 LAX FAT 82 77
What if I also want to calculate the difference between arrival delay and departure delay?
What if I also want to calculate the difference between arrival delay and departure delay?
year origin dest dep_delay arr_delay delay_diff
1 2010 EWR OMA 181 159 -22
2 2010 FLL SWF 281 256 -25
3 2010 JFK SJU 8 5 -3
What are the equivalent dplyr
functions?
Back to our original SQL query:
N pct_ontime
1 47932811 0.8222
SELECT can also be used to calculate summary statistics. For example, if we want the average departure delay:
mean_dep_delay
1 8.9586
Now suppose that I only want the mean departure delay for flights from EWR in 2013:
mean_dep_delay
1 14.703
What do you think should I do if I want the mean delay for each airport in November 2013?
mean_dep_delay
1 6.3220
2 2.2489
3 6.7138
4 -4.7167
5 1.6506
6 7.0526
7 2.3741
8 21.8136
9 -12.7778
10 -2.9286
Do you notice anything about this output?
origin mean_dep_delay
1 ABE 6.3220
2 ABI 2.2489
3 ABQ 6.7138
4 ABR -4.7167
5 ABY 1.6506
6 ACK 7.0526
7 ACT 2.3741
8 ACV 21.8136
9 ADK -12.7778
10 ADQ -2.9286
https://sta279-f23.github.io/class_activities/ca_lecture_18.html