Lecture 22: Strings and regular expressions
Recap: regular expressions
A regular expression is a pattern used to find matches in text.
Example: suppose I want to extract just the lecture number from the following file name. How would I do that?
"teaching/sta279-f23/slides/lecture_22.qmd"
Recap: regular expressions
A regular expression is a pattern used to find matches in text.
Example: suppose I want to extract just the lecture number from the following file name. How would I do that?
str_extract("teaching/sta279-f23/slides/lecture_22.qmd", "\\d+")
str_extract("teaching/sta279-f23/slides/lecture_22.qmd", "_\\d+")
str_extract("teaching/sta279-f23/slides/lecture_22.qmd",
"(?<=_)\\d+")
Recap: regular expressions
Last time, we learned the following regular expression tools:
\d
matches any digit (in R, have to type \\d
because we write the regex in a string)
.
matches any character (except \n
)
+
means “at least once”
(?<=)
and (?=)
are positive lookbehinds and lookaheads
|
is alternation (one pattern or another)
Some helpful string functions
Example: Suppose I have the following file names:
file_names <- c("research/project1/code.R",
"research/project1/data.csv",
"teaching/sta279/lecture1.qmd",
"teaching/sta279/example_data.csv")
I want to identify the files in the research
folder. What pattern would I want to match?
Some helpful string functions
Example: Suppose I have the following file names:
file_names <- c("research/project1/code.R",
"research/project1/data.csv",
"teaching/sta279/lecture1.qmd",
"teaching/sta279/example_data.csv")
I want to identify the files in the research
folder. What pattern would I want to match?
str_detect(file_names, "research")
[1] TRUE TRUE FALSE FALSE
Some helpful string functions
Example: Suppose I have the following file names:
file_names <- c("research/project1/code.R",
"research/project1/data.csv",
"teaching/sta279/lecture1.qmd",
"teaching/sta279/example_data.csv")
I want to identify the files in the research
folder. What pattern would I want to match?
str_subset(file_names, "research")
[1] "research/project1/code.R" "research/project1/data.csv"
Some helpful string functions
Example: Suppose I have the following file names:
file_names <- c("research/project1/code.R",
"research/project1/data.csv",
"teaching/sta279/lecture1.qmd",
"teaching/sta279/example_data.csv")
I want to identify the files in the research
folder. What pattern would I want to match?
str_view(file_names, "research")
[1] │ <research>/project1/code.R
[2] │ <research>/project1/data.csv
Some helpful string functions
Example: Suppose I have the following file names:
file_names <- c("research/project1/code.R",
"research/project1/data.csv",
"teaching/sta279/lecture1.qmd",
"teaching/sta279/example_data.csv")
How would I select only the csv files?
Some helpful string functions
Example: Suppose I have the following file names:
file_names <- c("research/project1/code.R",
"research/project1/data.csv",
"teaching/sta279/lecture1.qmd",
"teaching/sta279/example_data.csv")
How would I select only the csv files?
str_subset(file_names, "csv")
[1] "research/project1/data.csv" "teaching/sta279/example_data.csv"
Some helpful string functions
Example: Suppose I have the following file names:
file_names <- c("research/project1/code.R",
"research/project1/data.csv",
"research/project2/sim_output.csv",
"teaching/sta279/lecture1.qmd",
"teaching/sta279/example_data.csv")
How would I select only the csv files in the research
directory?
Some helpful string functions
Example: Suppose I have the following file names:
file_names <- c("research/project1/code.R",
"research/project1/data.csv",
"research/project2/sim_output.csv",
"teaching/sta279/lecture1.qmd",
"teaching/sta279/example_data.csv")
How would I select only the csv files in the research
directory?
str_subset(file_names, "research.+csv")
[1] "research/project1/data.csv" "research/project2/sim_output.csv"
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select just raspberry
and blackberry
?
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select just raspberry
and blackberry
?
str_view(strings, "berry")
[3] │ rasp<berry>
[4] │ black<berry>
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select “raspberry”, “blackberry”, “grrreat”, and “random”?
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select “raspberry”, “blackberry”, “grrreat”, and “random”?
[3] │ <r>aspbe<r><r>y
[4] │ blackbe<r><r>y
[5] │ g<r><r><r>eat
[6] │ <r>andom
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select just “raspberry”, “blackberry”, and “grrreat”?
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select just “raspberry”, “blackberry”, and “grrreat”?
[3] │ raspbe<rr>y
[4] │ blackbe<rr>y
[5] │ g<rrr>eat
str_view(strings, "r{2,}")
[3] │ raspbe<rr>y
[4] │ blackbe<rr>y
[5] │ g<rrr>eat
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select just “grrreat”?
str_view(strings, "r{3}")
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select “apple”, “raspberry”, or “blackberry”?
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select “apple”, “raspberry”, or “blackberry”?
str_view(strings, "(.)\\1{1}")
[1] │ a<pp>le
[3] │ raspbe<rr>y
[4] │ blackbe<rr>y
[5] │ g<rr>reat
More regular expressions
strings <- c("papa", "banana", "memento",
"blackberry", "grrreat", "random")
How would I select “papa”, “banana”, and “memento”?
More regular expressions
strings <- c("papa", "banana", "memento",
"blackberry", "grrreat", "random")
How would I select “papa”, “banana”, and “memento”?
str_view(strings, "(..)\\1{1}")
[1] │ <papa>
[2] │ b<anan>a
[3] │ <meme>nto
str_view(strings, "(..)+")
[1] │ <papa>
[2] │ <banana>
[3] │ <mement>o
[4] │ <blackberry>
[5] │ <grrrea>t
[6] │ <random>
More regular expressions
strings <- c("papa", "banana", "memento",
"blackberry", "toboggan", "random")
How would I select “banana” and “blackberry”?
More regular expressions
strings <- c("papa", "banana", "memento",
"blackberry", "toboggan", "random")
How would I select “banana” and “blackberry”?
[2] │ <b>anana
[4] │ <b>lackberry
More regular expressions
strings <- c("papa", "banana", "memento",
"blackberry", "toboggan", "random")
How would I select “papa” and “banana”?
More regular expressions
strings <- c("papa", "banana", "memento",
"blackberry", "toboggan", "random")
How would I select “papa” and “banana”?
[1] │ pap<a>
[2] │ banan<a>
More regular expressions
"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"
How would I extract $\mu$
and $\mu = \frac{1}{n} \sum_i x_i$
?
More regular expressions
"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"
How would I extract $\mu$
and $\mu = \frac{1}{n} \sum_i x_i$
?
str_extract("The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$",
"\\$.+\\$")
[1] "$\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"
More regular expressions
"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"
How would I extract $\mu$
and $\mu = \frac{1}{n} \sum_i x_i$
?
str_extract_all("The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$",
"\\$[^\\$]+\\$")
[[1]]
[1] "$\\mu$" "$\\mu = \\frac{1}{n} \\sum_i x_i$"
More regular expressions
"The current date (today) is November 3 [2007]."
How would I extract “(today)” and “[2007]”?
More regular expressions
"The current date (today) is November 3 [2007]."
How would I extract “(today)” and “[2007]”?
str_extract_all("The current date (today) is November 3 [2007].",
"[\\(\\[][^\\)\\]]+[\\)\\]]")
[[1]]
[1] "(today)" "[2007]"
What if I just want “today” and “2007”?
More regular expressions
"The current date (today) is November 3 [2007]."
str_extract_all("The current date (today) is November 3 [2007].",
"(?<=[\\(\\[])[^\\)\\]]+(?=[\\)\\]])")
More regular expressions
"The current date (today) is November 3 [2007]."
What if I only want the words?
str_extract_all("The current date (today) is November 3 [2007].",
"\\w+")
[[1]]
[1] "The" "current" "date" "today" "is" "November" "3"
[8] "2007"
More regular expressions
"The current date (today) is November 3 [2007]."
What if I only want the words?
str_replace_all("The current date (today) is November 3 [2007].",
"[^\\w\\s]", "")
[1] "The current date today is November 3 2007"