library(tidyverse)
7 R.7: String & RegExp
7.1 Introduction
In the previous session, we have often overlooked a particular type of data, the string. In R a sequence of characters is stored as a string.
In this session you will learn the distinctive features of the string type and how we can use string of characters within a programming language which is composed of particular string of characters as function names, variables.
As usual we will need the tidyverse
library.
Solution
7.2 String basics
7.2.1 String definition
A string can be defined within double "
or simple '
quote:
<- "This is a string"
string1 <- 'If I want to include a "quote"
string2 inside a string, I use single quotes'
If you forget to close a quote, you’ll see +
, the continuation character:
> "This is a string without a closing quote
+
+
+ HELP I'M STUCK
If this happens to you, press Escape
and try again!
To include a literal single or double quote in a string you can use \ to escape it:
<- "\"" # or '"'
double_quote <- '\'' # or "'" single_quote
If you want to include a literal backslash, you’ll need to double it up: "\\"
.
7.2.2 String representation
The printed representation of a string is not the same as a string itself:
<- c("\"", "\\")
x x
[1] "\"" "\\"
writeLines(x)
"
\
Some characters have a special representation, they are called special characters. The most common are "\n"
, newline, and "\t"
, tabulation, but you can see the complete list by requesting help on "
: ?'"'
7.2.3 String operation
You can perform basic operation on strings like
- String length
str_length(c("a", "R for data science", NA))
[1] 1 18 NA
- Combining strings
str_c("x", "y", "z")
[1] "xyz"
- Subsetting strings
<- c("Apple", "Banana", "Pear")
x str_sub(x, 1, 3)
[1] "App" "Ban" "Pea"
- Subsetting strings negative numbers count backwards from the end
str_sub(x, -3, -1)
[1] "ple" "ana" "ear"
- Lower case transform
str_to_lower(x)
[1] "apple" "banana" "pear"
- Ordering
str_sort(x)
[1] "Apple" "Banana" "Pear"
7.3 Matching patterns with REGular EXpressions (regex)
regexps form a very terse language that allows you to describe patterns in strings.
To learn regular expressions, we’ll use str_view()
and str_view_all()
. These functions take a character vector and a regular expression, and show you how they match.
You need to install the htmlwidgets
packages to use these functions.
Solution
library(htmlwidgets)
The most basic regular expression is the exact match.
<- c("apple", "banana", "pear")
x str_view(x, "an")
[2] │ b<an><an>a
The next step up in complexity is .
, which matches any character (except a newline):
<- c("apple", "banana", "pear")
x str_view(x, ".a.")
[2] │ <ban>ana
[3] │ p<ear>
But if .
matches any character, how do you match the character “.
”? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behaviour.
Like strings, regexps use the backslash, \
, to escape special behaviour. So to match an .
, you need the regexp \.
. Unfortunately this creates a problem.
We use strings to represent regular expressions, and \
is also used as an escape symbol in strings. So to create the regular expression \.
we need the string “\\.
”.
<- "\\."
dot writeLines(dot)
\.
str_view(c("abc", "a.c", "bef"), "a\\.c")
[2] │ <a.c>
If \
is used as an escape character in regular expressions, how do you match a literal \
? Well, you need to escape it, creating the regular expression \\
. To create that regular expression, you need to use a string, which also needs to escape \
. That means to match a literal \
you need to write “\\\\
” — you need four backslashes to match one!
<- "a\\b"
x writeLines(x)
a\b
str_view(x, "\\\\")
[1] │ a<\>b
- Explain why each of these strings doesn’t match a :
"\"
,"\\"
,"\\\"
. - How would you match the sequence
"'\
? - What patterns will the regular expression
\..\..\..
match? How would you represent it as a string?
Solution
"\"
: would leave an open quote as\"
would be interpreted as a literal double quote,"\\"
: would escape the second\
so we would be left with a blank,"\\\"
:\"
would again escape the quote so we would be left with an open quote.
We would need the following pattern
"\\\"'\\\\"
:\\\"
to escape the double quote,'
doesn’t need to be escaped (because the string is defined within double quote),\\\\
to escape\
.
It would match a string of the form: “.(anychar).(anychar).(anychar)”
<- c("alf.r.e.dd.ss..lsdf.d.kj") x str_view(x, "\\..\\..\\..")
7.3.1 Anchors
Until now we searched for patterns anywhere in the target string. But we can use anchors to be more precise.
^
Match the start of the string.$
Match the end of the string.
<- c("apple", "banana", "pear")
x str_view(x, "^a")
[1] │ <a>pple
str_view(x, "a$")
[2] │ banan<a>
<- c("apple pie", "apple", "apple cake")
x str_view(x, "^apple$")
[2] │ <apple>
How would you match the literal string
"$^$"
?Given the corpus of common words in
stringr::words
, create regular expressions that find all words that:- Start with “y”.
- End with “x”.
- Are exactly three letters long (Don’t cheat by using
str_length()
!). - Have seven letters or more.
Since this list is long, you might want to use the match argument to
str_view()
to show only the matching or non-matching words.What is the difference between these two commands:
str_view(stringr::words, "(or|ing$)") str_view(stringr::words, "(or|ing)$")
Solution
- We would need the pattern
"\\$\\^\\$"
- start with “y”:
"^y"
- end with “x”:
"x$"
- three letters long:
"^...$"
- seven letters or more:
"......."
- start with “y”:
"(or|ing$)"
matches words that either contain “or” or end with “ing”, while"(or|ing)$"
matches words that end either with “or” or “ing”.
7.3.2 Character classes and alternatives
In regular expression we have special character and patterns that match groups of characters.
\d
: matches any digit.\s
: matches any whitespace (e.g. space, tab, newline).[abc]
: matches a, b, or c.[^abc]
: matches anything except a, b, or c.
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
[2] │ <a.c>
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
[3] │ <a*c>
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
[4] │ <a >c
You can use alternations to pick between one or more alternative patterns. For example, abc|d..f
will match either abc
, or deaf
. Note that the precedent for |
is low, so that abc|xyz
matches abc
or xyz
not abcyz
or abxyz
.
Like with mathematical expressions, if alternations ever get confusing, use parentheses to make it clear what you want:
str_view(c("grey", "gray"), "gr(e|a)y")
[1] │ <grey>
[2] │ <gray>
Create regular expressions to find all words that:
- Start with a vowel.
- End with “ed”, but not with “eed”.
- End with “ing” or “ise”.
Solution
start with a vowel:
"^[aeiouy]"
"[^e]ed$"
"(ing|ise)$"
7.3.3 Repetition
Now that you know how to search for groups of characters you can define the number of times you want to see them.
?
: 0 or 1+
: 1 or more*
: 0 or more
<- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
x str_view(x, "CC?")
[1] │ 1888 is the longest year in Roman numerals: MD<CC><C>LXXXVIII
str_view(x, "CC+")
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
str_view(x, 'C[LX]+')
[1] │ 1888 is the longest year in Roman numerals: MDCC<CLXXX>VIII
You can also specify the number of matches precisely:
{n}
: exactly n{n,}
: n or more{,m}
: at most m{n,m}
: between n and m
str_view(x, "C{2}")
[1] │ 1888 is the longest year in Roman numerals: MD<CC>CLXXXVIII
str_view(x, "C{2,}")
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
str_view(x, "C{2,3}")
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
Describe in words what these regular expressions match (read carefully to see if I’m using a regular expression or a string that defines a regular expression):
^.*$
"\\{.+\\}"
\d{4}-\d{2}-\d{2}
"\\\\{4}"
Create regular expressions to find all words that:
- Start with three consonants.
- Have three or more vowels in a row.
- Have two or more vowel-consonant pairs in a row.
- Contain only consonants (Hint: thinking about matching “not”-vowels).
Solution
- (regex) starts with anything and ends with anything, matches whole thing
- (string regex) matches non-empty text in brackets
- (regex) matches date in format
yyyy-mm-dd
- (string regex) matches string that contains
\
repeated 4 times
"^[^aeiouy]{3}"
"[aeiouy]{3,}"
"([aeiouy][^aeiouy]){2,}"
"^[^aeiouy]+$"
7.3.4 Capture group
You learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with back references, like \1
, \2
etc.
str_view(fruit, "(..)\\1", match = TRUE)
[4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry
Describe, in words, what these expressions will match:
"(.)\\1\\1"
"(.)(.)\\2\\1"
"(..)\\1"
"(.).\\1.\\1"
"(.)(.)(.).*\\3\\2\\1"
Construct regular expressions to match words that:
- Start and end with the same character.
- Contain a repeated pair of letters (e.g.
"church"
contains"ch"
repeated twice). - Contain one letter repeated in at least three places (e.g.
"eleven"
contains three"e"
s).
Solution
- matches a character repeated thrice
- matches two characters followed by their reverse order (“abba”)
- matches two characters repeated twice (not each)
- matches a character repeated 3 times with one character between each repeat
- matches 3 characters, followed by any characters, then the 3 characters in reverse order
"^(.).*\\1$"
"([A-Za-z]{2}).*\\1"
"([A-Za-z]).*\\1.*\\1"
7.3.5 Detect matches
<- c("apple", "banana", "pear")
x str_detect(x, "e")
[1] TRUE FALSE TRUE
How many common words start with “t”?
sum(str_detect(words, "^t"))
[1] 65
What proportion of common words ends with a vowel?
mean(str_detect(words, "[aeiouy]$"))
[1] 0.3622449
7.3.6 Combining detection
Find all words containing at least one vowel, and negate
<- !str_detect(words, "[aeiouy]") no_vowels_1
Find all words consisting only of consonants (non-vowels)
<- str_detect(words, "^[^aeiouy]+$")
no_vowels_2 identical(no_vowels_1, no_vowels_2)
[1] TRUE
7.3.7 With tibble
<- tibble(word = words) %>% mutate(i = rank(word))
df %>% filter(str_detect(word, "x$")) df
# A tibble: 4 × 2
word i
<chr> <dbl>
1 box 108
2 sex 747
3 six 772
4 tax 841
7.3.8 Extract matches
head(sentences)
[1] "The birch canoe slid on the smooth planks."
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."
[4] "These days a chicken leg is a rare dish."
[5] "Rice is often served in round bowls."
[6] "The juice of lemons makes fine punch."
We want to find all sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:
<- c("red", "orange", "yellow", "green", "blue", "purple")
colours <- str_c(colours, collapse = "|")
colour_match colour_match
[1] "red|orange|yellow|green|blue|purple"
We can select the sentences that contain a colour, and then extract the first colour from each sentence:
%>% str_subset(colour_match) %>% str_extract(colour_match) sentences
[1] "blue" "blue" "red" "red" "red" "blue" "yellow" "red"
[9] "red" "green" "red" "red" "blue" "red" "red" "red"
[17] "red" "blue" "red" "blue" "red" "green" "red" "red"
[25] "red" "red" "red" "red" "green" "red" "green" "red"
[33] "purple" "green" "red" "red" "red" "red" "red" "blue"
[41] "red" "blue" "red" "red" "red" "red" "green" "green"
[49] "green" "red" "red" "yellow" "red" "orange" "red" "red"
[57] "red"
We can also extract all colours from each selected sentence, as a list of vectors:
%>% str_subset(colour_match) %>% str_extract_all(colour_match) sentences
7.3.9 Grouped matches
Imagine we want to extract nouns from the sentences. As a heuristic, we’ll look for any word that comes after “a” or “the”.
<- "(a|the) ([^ ]+)"
noun <- sentences %>%
has_noun str_subset(noun) %>%
head(10)
%>%
has_noun str_extract(noun)
[1] "the smooth" "the sheet" "the depth" "a chicken" "the parked"
[6] "the sun" "the huge" "the ball" "the woman" "a helps"
str_extract()
gives us the complete match; str_match()
gives each individual component.
%>%
has_noun str_match(noun)
[,1] [,2] [,3]
[1,] "the smooth" "the" "smooth"
[2,] "the sheet" "the" "sheet"
[3,] "the depth" "the" "depth"
[4,] "a chicken" "a" "chicken"
[5,] "the parked" "the" "parked"
[6,] "the sun" "the" "sun"
[7,] "the huge" "the" "huge"
[8,] "the ball" "the" "ball"
[9,] "the woman" "the" "woman"
[10,] "a helps" "a" "helps"
Find all words that come after a number
like one
, two
, three
etc. Pull out both the number and the word.
Solution
Start by creating a vector of words defining digits:
<- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine") nums
Next, create the corresponding regular expression to catch any worded digit:
<- str_c(nums, collapse = "|") nums_c
Then, construct the full regular expression where:
(?<![Y])X
means capture string X
only if not preceded by string Y
.
Here, X
corresponds to our worded digit expression and Y
is any letter (:alpha:
).
This way, (?<![:alpha:]) (one|two|three|four|five|six|seven|eight|nine)
will match any of our digit only if not preceded by a letter.
We then add a blank space and [A-Za-z]+
to capture the word following our worded digit:
<- str_c("(?<![:alpha:])", "(", nums_c, ")", " ", "([A-Za-z]+)", sep = "") re_str
Let’s apply it to our sentences:
%>%
sentences # get the subset of sentences where a match occurred
str_subset(regex(re_str, ignore_case = TRUE)) %>%
# for each sentence get the matched string
str_extract_all(regex(re_str, ignore_case = TRUE)) %>%
# convert to vector
unlist() %>%
# convert to tibble
as_tibble_col(column_name = "expr") %>%
# split matched strings into components
::separate(
tidyrcol = "expr",
into = c("digit", "word"),
remove = FALSE
)
# A tibble: 30 × 3
expr digit word
<chr> <chr> <chr>
1 Four hours Four hours
2 Two blue Two blue
3 seven books seven books
4 two met two met
5 two factors two factors
6 three lists three lists
7 Two plus Two plus
8 seven is seven is
9 two when two when
10 Eight miles Eight miles
# ℹ 20 more rows
7.3.10 Replacing matches
Instead of replacing with a fixed string, you can use back references to insert components of the match. In the following code, I flip the order of the second and third words.
%>%
sentences str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
[1] "The canoe birch slid on the smooth planks."
[2] "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."
[4] "These a days chicken leg is a rare dish."
[5] "Rice often is served in round bowls."
- Replace all forward slashes in a string with backslashes.
- Implement a simple version of
str_to_lower()
usingstr_replace_all()
. - Switch the first and last letters in words. Which of those strings are still words?
Solution
We can use the function
str_replace_all
with a replacement string:<- "/test/" test_str writeLines(test_str)
/test/
%>% test_str str_replace_all(pattern = "/", replacement = "\\\\") %>% writeLines()
\test\
We also can use the function
str_replace_all
with a replacement function:%>% sentences str_replace_all(pattern = "([A-Z])", replacement = tolower) %>% head(5)
[1] "the birch canoe slid on the smooth planks." [2] "glue the sheet to the dark blue background." [3] "it's easy to tell the depth of a well." [4] "these days a chicken leg is a rare dish." [5] "rice is often served in round bowls."
Any words that start and end with the same letter and a few other examples like “war –> raw”:
%>% words str_replace(pattern = "(^.)(.*)(.$)", replacement = "\\3\\2\\1") %>% head(5)
[1] "a" "ebla" "tboua" "ebsoluta" "tccepa"
7.3.11 Splitting
%>%
sentences head(5) %>%
str_split("\\s")
[[1]]
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
[8] "planks."
[[2]]
[1] "Glue" "the" "sheet" "to" "the"
[6] "dark" "blue" "background."
[[3]]
[1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
[[4]]
[1] "These" "days" "a" "chicken" "leg" "is" "a"
[8] "rare" "dish."
[[5]]
[1] "Rice" "is" "often" "served" "in" "round" "bowls."
See you in R.8: Factors
License: FIXME.
Made with Quarto.