R.7: String & RegExp

Laurent Modolo laurent.modolo@ens-lyon.fr

2022

https://can.gitbiopages.ens-lyon.fr/R_basis/

1 Introduction

In the previous session, we have often overlooked a particular type of data, the string. In R a sequence of characters is stored as a string.

In this session you will learn the distinctive features of the string type and how we can use string of character within a programming language which is composed of particular string of characters as function names, variables.

As usual we will need the tidyverse library.

Solution

library(tidyverse)

2 String basics

2.1 String definition

A string can be defined within double " or simple ' quote

string1 <- "This is a string"
string2 <- 'If I want to include a "quote"
inside a string, I use single quotes'

If you forget to close a quote, you’ll see +, the continuation character:

> "This is a string without a closing quote
+ 
+ 
+ HELP I'M STUCK

If this happens to you, press Escape and try again!

To include a literal single or double quote in a string you can use \ to escape it:

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

If you want to include a literal backslash, you’ll need to double it up: "\\".

2.2 String representation

The printed representation of a string is not the same as string itself

x <- c("\"", "\\")
x
[1] "\"" "\\"
writeLines(x)
"
\

Some characters have a special representation, they are called special characters. The most common are "\n", newline, and "\t", tabulation, but you can see the complete list by requesting help on ": ?'"'

2.3 String operation

You can perform basic operation on strings like

  • String length
str_length(c("a", "R for data science", NA))
[1]  1 18 NA
  • Combining strings
str_c("x", "y", "z")
[1] "xyz"
  • Subsetting strings
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
[1] "App" "Ban" "Pea"
  • Subsetting strings negative numbers count backwards from the end
str_sub(x, -3, -1)
[1] "ple" "ana" "ear"
  • Lower case transform
str_to_lower(x)
[1] "apple"  "banana" "pear"  
  • ordering
str_sort(x)
[1] "Apple"  "Banana" "Pear"  

3 Matching patterns with regular expressions

Regexps are a very terse language that allows you to describe patterns in strings.

To learn regular expressions, we’ll use str_view() and str_view_all(). These functions take a character vector and a regular expression, and show you how they match.

You need to install the htmlwidgets packages to use these functions

Solution

library(htmlwidgets)

The most basic regular expression is the exact match.

x <- c("apple", "banana", "pear")
str_view(x, "an")

The next step up in complexity is ., which matches any character (except a newline):

x <- c("apple", "banana", "pear")
str_view(x, ".a.")

But if “.” matches any character, how do you match the character “.”? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behavior.

Like strings, regexps use the backslash, \, to escape special behaviour. So to match an ., you need the regexp \.. Unfortunately this creates a problem.

We use strings to represent regular expressions, and \ is also used as an escape symbol in strings. So to create the regular expression \. we need the string “\\.”.

dot <- "\\."
writeLines(dot)
\.
str_view(c("abc", "a.c", "bef"), "a\\.c")

If \ is used as an escape character in regular expressions, how do you match a literal \? Well, you need to escape it, creating the regular expression \\. To create that regular expression, you need to use a string, which also needs to escape \. That means to match a literal \ you need to write “\\\\” — you need four backslashes to match one!

x <- "a\\b"
writeLines(x)
a\b
str_view(x, "\\\\")

3.1 Exercises

  • Explain why each of these strings doesn’t match a : “\”, “\\”, “\\\”.
  • How would you match the sequence "'\?
  • What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

3.2 Anchors

Until now we searched for patterns anywhere in the target string. But we can use anchors to be more precise.

  • ^ Match the start of the string.
  • $ Match the end of the string.
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
x <- c("apple pie", "apple", "apple cake")
str_view(x, "^apple$")

3.3 Exercices

  • How would you match the literal string "$^$"?
  • Given the corpus of common words in stringr::words, create regular expressions that find all words that: -Start with “y”.
    • End with “x”
    • Are exactly three letters long. (Don’t cheat by using str_length()!)
    • Have seven letters or more.

Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.

3.4 Character classes and alternatives

In regular expression we have special character and patterns that match groups of characters.

  • \d: matches any digit.
  • \s: matches any whitespace (e.g. space, tab, newline).
  • [abc]: matches a, b, or c.
  • [^abc]: matches anything except a, b, or c.
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")

You can use alternations to pick between one or more alternative patterns. For example, abc|d..f will match either abc, or deaf. Note that the precedent for | is low, so that abc|xyz matches abc or xyz not abcyz or abxyz. Like with mathematical expressions, if presidents ever get confusing, use parentheses to make it clear what you want:

str_view(c("grey", "gray"), "gr(e|a)y")

3.5 Exercices

Create regular expressions to find all words that:

  • Start with a vowel.
  • That only contains consonants. (Hint: thinking about matching “not”-vowels.)
  • End with ed, but not with eed.
  • End with ing or ise.

3.6 Repetition

Now that you know how to search for groups of characters you can define the number of times you want to see them.

  • ?: 0 or 1
  • +: 1 or more
  • *: 0 or more
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, 'C[LX]+')

You can also specify the number of matches precisely:

  • {n}: exactly n
  • {n,}: n or more
  • {,m}: at most m
  • {n,m}: between n and m
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")

3.7 Exercices

  • Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
    • ^.*$
    • "\\{.+\\}"
    • \d{4}-\d{2}-\d{2}
    • "\\\\{4}"
  • Create regular expressions to find all words that:
    • Start with three consonants.
    • Have three or more vowels in a row.
    • Have two or more vowel-consonant pairs in a row.

3.8 Grouping

You learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with back references, like \1, \2 etc.

str_view(fruit, "(..)\\1", match = TRUE)

3.9 Exercices

  • Describe, in words, what these expressions will match:
    • "(.)\1\1"
    • "(.)(.)\\2\\1"
    • "(..)\1"
    • "(.).\\1.\\1"
    • "(.)(.)(.).*\\3\\2\\1"
  • Construct regular expressions to match words that:
    • Start and end with the same character.
    • Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
    • Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

3.10 Detect matches

x <- c("apple", "banana", "pear")
str_detect(x, "e")
[1]  TRUE FALSE  TRUE

How many common words start with t?

sum(str_detect(words, "^t"))
[1] 65

What proportion of common words ends with a vowel?

mean(str_detect(words, "[aeiou]$"))
[1] 0.2765306

3.11 Combining detection

Find all words containing at least one vowel, and negate

no_vowels_1 <- !str_detect(words, "[aeiou]")

Find all words consisting only of consonants (non-vowels)

no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
[1] TRUE

3.12 With tibble

df <- tibble(
  word = words, 
  i = seq_along(word)
)
df %>% 
  filter(str_detect(word, "x$"))
# A tibble: 4 × 2
  word      i
  <chr> <int>
1 box     108
2 sex     747
3 six     772
4 tax     841

3.13 Extract matches

head(sentences)
[1] "The birch canoe slid on the smooth planks." 
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."     
[4] "These days a chicken leg is a rare dish."   
[5] "Rice is often served in round bowls."       
[6] "The juice of lemons makes fine punch."      

We want to find all sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
[1] "red|orange|yellow|green|blue|purple"

3.14 Extract matches

We can select the sentences that contain a colour, and then extract the colour to figure out which one it is:

has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
[1] "blue" "blue" "red"  "red"  "red"  "blue"

3.15 Grouped matches

Imagine we want to extract nouns from the sentences. As a heuristic, we’ll look for any word that comes after “a” or “the”.

noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
  str_subset(noun) %>%
  head(10)
has_noun %>% 
  str_extract(noun)
 [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked"
 [6] "the sun"    "the huge"   "the ball"   "the woman"  "a helps"   

str_extract() gives us the complete match; str_match() gives each individual component.

has_noun %>% 
  str_match(noun)
      [,1]         [,2]  [,3]     
 [1,] "the smooth" "the" "smooth" 
 [2,] "the sheet"  "the" "sheet"  
 [3,] "the depth"  "the" "depth"  
 [4,] "a chicken"  "a"   "chicken"
 [5,] "the parked" "the" "parked" 
 [6,] "the sun"    "the" "sun"    
 [7,] "the huge"   "the" "huge"   
 [8,] "the ball"   "the" "ball"   
 [9,] "the woman"  "the" "woman"  
[10,] "a helps"    "a"   "helps"  

3.16 Exercises

  • Find all words that come after a number like one, two, three etc. Pull out both the number and the word.

3.17 Replacing matches

Instead of replacing with a fixed string, you can use back references to insert components of the match. In the following code, I flip the order of the second and third words.

sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
  head(5)
[1] "The canoe birch slid on the smooth planks." 
[2] "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."     
[4] "These a days chicken leg is a rare dish."   
[5] "Rice often is served in round bowls."       

3.18 Exercices

  • Replace all forward slashes in a string with backslashes.
  • Implement a simple version of str_to_lower() using replace_all().
  • Switch the first and last letters in words. Which of those strings are still words?

3.19 Splitting

sentences %>%
  head(5) %>% 
  str_split("\\s")
[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
[8] "planks."

[[2]]
[1] "Glue"        "the"         "sheet"       "to"          "the"        
[6] "dark"        "blue"        "background."

[[3]]
[1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

[[4]]
[1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
[8] "rare"    "dish."  

[[5]]
[1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

3.20 See you in R.8: Factors