Lecture 11: Rectangular data

Content so far (in R and Python!)

Simulation
Iteration
Vectors (R), 1-d arrays (Python), and lists
Functions

What’s missing: actual data sets!

Learning goals

Review/refresh data manipulation from STA 112
Explore different data objects in R and Python
Work with more challenging data, requiring more difficult manipulation
Combine information from multiple datasets
Learn tools for different data types (strings, factors, dates and times)

Rectangular data

library(dplyr)
starwars

# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Rectangular data in R

In R, there are two main ways of storing rectangular data:

matrices
data frames

Matrices

A matrix generalizes a vector to two dimensions:

x <- matrix(c(1, 2, 3, 4, 5, 6), nrow=2)
x

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Each row is a vector
Each column is a vector

Indexing matrices

x <- matrix(c(1, 2, 3, 4, 5, 6), nrow=2)
x

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

x[1,]

[1] 1 3 5

x[,1]

[1] 1 2

x[1,2]

[1] 3

Use single square brackets [ ] to index
The first coordinate is the row, the second coordinate is the column

Uses and limitations of matrices

Correspond to the matrices we know and love from linear algebra
Usually the right way to store 2-d data for doing math (like matrix multiplication)
Like vectors, contain only one type of data

x <- matrix(c(1, 2, 3, 'a', 5, 6), nrow=2)
x

     [,1] [,2] [,3]
[1,] "1"  "3"  "5" 
[2,] "2"  "a"  "6"

Data frames

example_df <- data.frame(x = c(1, 2, 3),
                         y = c('a', 'b', 'c'))
example_df

  x y
1 1 a
2 2 b
3 3 c

Aside: what are data frames?

example_df <- data.frame(x = c(1, 2, 3),
                         y = c('a', 'b', 'c'))
typeof(example_df)

[1] "list"

Matrices are like a 2-d vector
Data frames are a special type of list! With some requirements:
- Each component is a vector
- Each component has the same length

Indexing data frames

[ ] can work for indexing data frames, just like matrices:

example_df <- data.frame(x = c(1, 2, 3),
                         y = c('a', 'b', 'c'))
example_df

  x y
1 1 a
2 2 b
3 3 c

example_df[2, 1]

[1] 2

Indexing data frames

Like lists, [[ ]] and $ can also be used:

example_df <- data.frame(x = c(1, 2, 3),
                         y = c('a', 'b', 'c'))
example_df

  x y
1 1 a
2 2 b
3 3 c

example_df$x

[1] 1 2 3

example_df[["x"]]

[1] 1 2 3

What do you do with a data frame?

Data manipulation and cleaning
Visualization
Input for modeling

Data manipulation

glimpse(starwars)

Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

What manipulation might I want to do with the starwars data?

`dplyr`: Tools for data wrangling

part of the tidyverse
provides a “grammar of data manipulation”: useful verbs (functions) for manipulating data
we will cover the key dplyr functions

Verbs for data wrangling

select(): take a subset of the columns (i.e., features, variables)
filter(): take a subset of the rows (i.e., observations)
mutate(): add or modify existing columns
arrange(): sort the rows
summarize(): aggregate the data across rows (e.g., group it according to some criteria)

Creating a subset of the rows

Question: Suppose I only want the droids in the starwars data. How would I choose only those rows?

Creating a subset of the rows

Question: Suppose I only want the droids in the starwars data. How would I choose only those rows?

filter(starwars, species == "Droid")

# A tibble: 6 × 14
  name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
  <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
4 IG-88     200   140 none       metal       red               15 none  masculi…
5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
6 BB8        NA    NA none       none        black             NA none  masculi…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Creating a subset of the rows

starwars |>
  filter(species == "Droid")

# A tibble: 2 × 14
  name  height  mass hair_color skin_color  eye_color birth_year sex   gender   
  <chr>  <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>    
1 C-3PO    167    75 <NA>       gold        yellow           112 none  masculine
2 R2-D2     96    32 <NA>       white, blue red               33 none  masculine
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

|> is called the pipe. It means “take <this>, THEN do <that>”
filter keeps only the rows which satisfy a specific condition

Calculating summary statistics

Question: What is the average height for droids in the dataset?

Calculating summary statistics

Question: What is the average height for droids in the dataset?

starwars |>
  filter(species == "Droid") |>
  summarize(mean_height = mean(height))

# A tibble: 1 × 1
  mean_height
        <dbl>
1          NA

pipes (|>) can be chained together
summarize calculates summary statistics
Why am I getting NA?

Handling missing values

# A tibble: 6 × 14
  name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
  <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
4 IG-88     200   140 none       metal       red               15 none  masculi…
5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
6 BB8        NA    NA none       none        black             NA none  masculi…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

starwars |>
  filter(species == "Droid") |>
  summarize(mean_height = mean(height, na.rm=T))

# A tibble: 1 × 1
  mean_height
        <dbl>
1        131.

Calculating summary statistics

Question: What if I want the average height for humans?

starwars |>
  filter(species == "Droid") |>
  summarize(mean_height = mean(height, na.rm=T))

Calculating summary statistics

Question: What if I want the average height for humans?

starwars |>
  filter(species == "Human") |>
  summarize(mean_height = mean(height, na.rm=T))

# A tibble: 1 × 1
  mean_height
        <dbl>
1        177.

Calculating summary statistics

Question: What is the average body mass for each species?

Calculating summary statistics

Question: What is the average body mass for each species?

starwars |>
  group_by(species) |>
  summarize(mean_height = mean(height, na.rm=T))

# A tibble: 38 × 2
   species   mean_height
   <chr>           <dbl>
 1 Aleena            79 
 2 Besalisk         198 
 3 Cerean           198 
 4 Chagrian         196 
 5 Clawdite         168 
 6 Droid            131.
 7 Dug              112 
 8 Ewok              88 
 9 Geonosian        183 
10 Gungan           209.
# ℹ 28 more rows

Creating new variables

Question: What is the distribution of the ratio of body mass to height?

Creating new variables

Question: What is the distribution of the ratio of body mass to height?

starwars |>
  mutate(body_ratio = mass/height)

Creating new variables

starwars |>
  mutate(body_ratio = mass/height) |>
  group_by(species) |>
  summarize(mean_ratio = mean(body_ratio, na.rm=T),
            sd_ratio = sd(body_ratio, na.rm=T))

# A tibble: 38 × 3
   species   mean_ratio sd_ratio
   <chr>          <dbl>    <dbl>
 1 Aleena         0.190  NA     
 2 Besalisk       0.515  NA     
 3 Cerean         0.414  NA     
 4 Chagrian     NaN      NA     
 5 Clawdite       0.327  NA     
 6 Droid          0.453   0.174 
 7 Dug            0.357  NA     
 8 Ewok           0.227  NA     
 9 Geonosian      0.437  NA     
10 Gungan         0.351   0.0207
# ℹ 28 more rows

Creating new variables

starwars |>
  mutate(body_ratio = mass/height) |>
  group_by(species) |>
  summarize(mean_ratio = mean(body_ratio, na.rm=T),
            sd_ratio = sd(body_ratio, na.rm=T),
            N = n())

# A tibble: 38 × 4
   species   mean_ratio sd_ratio     N
   <chr>          <dbl>    <dbl> <int>
 1 Aleena         0.190  NA          1
 2 Besalisk       0.515  NA          1
 3 Cerean         0.414  NA          1
 4 Chagrian     NaN      NA          1
 5 Clawdite       0.327  NA          1
 6 Droid          0.453   0.174      6
 7 Dug            0.357  NA          1
 8 Ewok           0.227  NA          1
 9 Geonosian      0.437  NA          1
10 Gungan         0.351   0.0207     3
# ℹ 28 more rows

Summary so far

filter: choose certain rows
summarize: calculate summary statistics
group_by: group rows together
mutate: create new columns

Data frames and tibbles

example_df <- data.frame(x = c(1, 2, 3),
                         y = c('a', 'b', 'c'))
class(example_df)

[1] "data.frame"

class(starwars)

[1] "tbl_df"     "tbl"        "data.frame"

Tibbles are special types of data frames, often used in tidyverse packages

Class activity

https://sta279-f23.github.io/class_activities/ca_lecture_11.html

Lecture 11: Rectangular data

Content so far (in R and Python!)

Learning goals

Rectangular data

Rectangular data in R

Matrices

Indexing matrices

Uses and limitations of matrices

Data frames

Aside: what are data frames?

Indexing data frames

Indexing data frames

What do you do with a data frame?

Data manipulation

dplyr: Tools for data wrangling

Verbs for data wrangling

Creating a subset of the rows

Creating a subset of the rows

Creating a subset of the rows

Calculating summary statistics

Calculating summary statistics

Handling missing values

Calculating summary statistics

Calculating summary statistics

Calculating summary statistics

Calculating summary statistics

Creating new variables

Creating new variables

Creating new variables

Creating new variables

Summary so far

Data frames and tibbles

Class activity

`dplyr`: Tools for data wrangling