Lecture 22: Strings and regular expressions

Recap: regular expressions

A regular expression is a pattern used to find matches in text.

Example: suppose I want to extract just the lecture number from the following file name. How would I do that?

"teaching/sta279-f23/slides/lecture_22.qmd"

Recap: regular expressions

A regular expression is a pattern used to find matches in text.

Example: suppose I want to extract just the lecture number from the following file name. How would I do that?

str_extract("teaching/sta279-f23/slides/lecture_22.qmd", "\\d+")
[1] "279"
str_extract("teaching/sta279-f23/slides/lecture_22.qmd", "_\\d+")
[1] "_22"
str_extract("teaching/sta279-f23/slides/lecture_22.qmd", 
            "(?<=_)\\d+")
[1] "22"

Recap: regular expressions

Last time, we learned the following regular expression tools:

  • \d matches any digit (in R, have to type \\d because we write the regex in a string)
  • . matches any character (except \n)
  • + means “at least once”
  • (?<=) and (?=) are positive lookbehinds and lookaheads
  • | is alternation (one pattern or another)

Recap: tools for working with strings

So far, we have learned the following:

  • str_extract extracts the first match
str_extract("teaching/sta279-f23/slides/lecture_22.qmd", "\\d+")
[1] "279"
  • str_exctract_all extracts all matches
str_extract_all("teaching/sta279-f23/slides/lecture_22.qmd", "\\d+")
[[1]]
[1] "279" "23"  "22" 

Goal for today: learn more string and regex tools!

Some helpful string functions

Example: Suppose I have the following file names:

file_names <- c("research/project1/code.R", 
                "research/project1/data.csv",
                "teaching/sta279/lecture1.qmd",
                "teaching/sta279/example_data.csv")

I want to identify the files in the research folder. What pattern would I want to match?

Some helpful string functions

Example: Suppose I have the following file names:

file_names <- c("research/project1/code.R", 
                "research/project1/data.csv",
                "teaching/sta279/lecture1.qmd",
                "teaching/sta279/example_data.csv")

I want to identify the files in the research folder. What pattern would I want to match?

str_detect(file_names, "research")
[1]  TRUE  TRUE FALSE FALSE

Some helpful string functions

Example: Suppose I have the following file names:

file_names <- c("research/project1/code.R", 
                "research/project1/data.csv",
                "teaching/sta279/lecture1.qmd",
                "teaching/sta279/example_data.csv")

I want to identify the files in the research folder. What pattern would I want to match?

str_subset(file_names, "research")
[1] "research/project1/code.R"   "research/project1/data.csv"

Some helpful string functions

Example: Suppose I have the following file names:

file_names <- c("research/project1/code.R", 
                "research/project1/data.csv",
                "teaching/sta279/lecture1.qmd",
                "teaching/sta279/example_data.csv")

I want to identify the files in the research folder. What pattern would I want to match?

str_view(file_names, "research")
[1] │ <research>/project1/code.R
[2] │ <research>/project1/data.csv

Some helpful string functions

Example: Suppose I have the following file names:

file_names <- c("research/project1/code.R", 
                "research/project1/data.csv",
                "teaching/sta279/lecture1.qmd",
                "teaching/sta279/example_data.csv")

How would I select only the csv files?

Some helpful string functions

Example: Suppose I have the following file names:

file_names <- c("research/project1/code.R", 
                "research/project1/data.csv",
                "teaching/sta279/lecture1.qmd",
                "teaching/sta279/example_data.csv")

How would I select only the csv files?

str_subset(file_names, "csv")
[1] "research/project1/data.csv"       "teaching/sta279/example_data.csv"

Some helpful string functions

Example: Suppose I have the following file names:

file_names <- c("research/project1/code.R", 
                "research/project1/data.csv",
                "research/project2/sim_output.csv",
                "teaching/sta279/lecture1.qmd",
                "teaching/sta279/example_data.csv")

How would I select only the csv files in the research directory?

Some helpful string functions

Example: Suppose I have the following file names:

file_names <- c("research/project1/code.R", 
                "research/project1/data.csv",
                "research/project2/sim_output.csv",
                "teaching/sta279/lecture1.qmd",
                "teaching/sta279/example_data.csv")

How would I select only the csv files in the research directory?

str_subset(file_names, "research.+csv")
[1] "research/project1/data.csv"       "research/project2/sim_output.csv"

More regular expressions

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select just raspberry and blackberry?

More regular expressions

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select just raspberry and blackberry?

str_view(strings, "berry")
[3] │ rasp<berry>
[4] │ black<berry>

More regular expressions

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select “raspberry”, “blackberry”, “grrreat”, and “random”?

More regular expressions

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select “raspberry”, “blackberry”, “grrreat”, and “random”?

str_view(strings, "r")
[3] │ <r>aspbe<r><r>y
[4] │ blackbe<r><r>y
[5] │ g<r><r><r>eat
[6] │ <r>andom

More regular expressions

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select just “raspberry”, “blackberry”, and “grrreat”?

More regular expressions

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select just “raspberry”, “blackberry”, and “grrreat”?

str_view(strings, "rr+")
[3] │ raspbe<rr>y
[4] │ blackbe<rr>y
[5] │ g<rrr>eat
str_view(strings, "r{2,}")
[3] │ raspbe<rr>y
[4] │ blackbe<rr>y
[5] │ g<rrr>eat

More regular expressions

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select just “grrreat”?

str_view(strings, "r{3}")
[5] │ g<rrr>eat

More regular expressions

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select “apple”, “raspberry”, or “blackberry”?

More regular expressions

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select “apple”, “raspberry”, or “blackberry”?

str_view(strings, "(.)\\1{1}")
[1] │ a<pp>le
[3] │ raspbe<rr>y
[4] │ blackbe<rr>y
[5] │ g<rr>reat

More regular expressions

strings <- c("papa", "banana", "memento", 
             "blackberry", "grrreat", "random")

How would I select “papa”, “banana”, and “memento”?

More regular expressions

strings <- c("papa", "banana", "memento", 
             "blackberry", "grrreat", "random")

How would I select “papa”, “banana”, and “memento”?

str_view(strings, "(..)\\1{1}")
[1] │ <papa>
[2] │ b<anan>a
[3] │ <meme>nto
str_view(strings, "(..)+")
[1] │ <papa>
[2] │ <banana>
[3] │ <mement>o
[4] │ <blackberry>
[5] │ <grrrea>t
[6] │ <random>

More regular expressions

strings <- c("papa", "banana", "memento", 
             "blackberry", "toboggan", "random")

How would I select “banana” and “blackberry”?

More regular expressions

strings <- c("papa", "banana", "memento", 
             "blackberry", "toboggan", "random")

How would I select “banana” and “blackberry”?

str_view(strings, "^b")
[2] │ <b>anana
[4] │ <b>lackberry

More regular expressions

strings <- c("papa", "banana", "memento", 
             "blackberry", "toboggan", "random")

How would I select “papa” and “banana”?

More regular expressions

strings <- c("papa", "banana", "memento", 
             "blackberry", "toboggan", "random")

How would I select “papa” and “banana”?

str_view(strings, "a$")
[1] │ pap<a>
[2] │ banan<a>

More regular expressions

"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"

How would I extract $\mu$ and $\mu = \frac{1}{n} \sum_i x_i$?

More regular expressions

"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"

How would I extract $\mu$ and $\mu = \frac{1}{n} \sum_i x_i$?

str_extract("The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$",
            "\\$.+\\$")
[1] "$\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"

More regular expressions

"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"

How would I extract $\mu$ and $\mu = \frac{1}{n} \sum_i x_i$?

str_extract_all("The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$",
            "\\$[^\\$]+\\$")
[[1]]
[1] "$\\mu$"                            "$\\mu = \\frac{1}{n} \\sum_i x_i$"

More regular expressions

"The current date (today) is November 3 [2007]."

How would I extract “(today)” and “[2007]”?

More regular expressions

"The current date (today) is November 3 [2007]."

How would I extract “(today)” and “[2007]”?

str_extract_all("The current date (today) is November 3 [2007].",
                "[\\(\\[][^\\)\\]]+[\\)\\]]")
[[1]]
[1] "(today)" "[2007]" 

What if I just want “today” and “2007”?

More regular expressions

"The current date (today) is November 3 [2007]."
str_extract_all("The current date (today) is November 3 [2007].",
                "(?<=[\\(\\[])[^\\)\\]]+(?=[\\)\\]])")
[[1]]
[1] "today" "2007" 

More regular expressions

"The current date (today) is November 3 [2007]."

What if I only want the words?

str_extract_all("The current date (today) is November 3 [2007].",
                "\\w+")
[[1]]
[1] "The"      "current"  "date"     "today"    "is"       "November" "3"       
[8] "2007"    

More regular expressions

"The current date (today) is November 3 [2007]."

What if I only want the words?

str_replace_all("The current date (today) is November 3 [2007].",
                "[^\\w\\s]", "")
[1] "The current date today is November 3 2007"

A list of some other useful tools

  • * means “appears 0 or more times”
  • {m} means “appears \(m\) times”
  • \b is a word boundary (use \\b in R)
  • \w is any alphanumeric character, or underscore (use \\w in R)
  • ( ) is a capture group
  • [ ] is a set of characters
  • \s denotes spaces (use \\s in R)
  • ^ anchors at the beginning, $ anchors at the end