Regular Expressions: Look Arounds
Regex Lookarounds in R
Regular expressions (i.e. “regex”) are extremly useful to help clean data and parse strings. For more information on regular expressions in general, check out https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf. For this post, we’re going to focus on using lookarounds in regex. Lookarounds are helpful to reduce complicated regex expressions and make it easier to identiify what we want to find in a string. Lookarounds are assertions so they return a “match” or “not a match” that indicates if the pattern matches or doesn’t match what it’s looking for.
There are two types of lookarounds: lookaheads and lookbehinds. Both of these types have negative and postive variations. The syntax of all four types of lookarounds are as follows:
Type | Syntax |
---|---|
Positive Lookahead | (?=pattern) |
Negative Lookahead | (?! pattern) |
Positive Lookbehind | (?<= pattern) |
Negative Lookbehind | (?<! pattern) |
Lookaheads
First, we’ll start with lookaheads. Positive lookaheads matches at a position where the pattern inside the lookahead can be matched. Negative lookaheads do the same but identify where the pattern fails to match.
Positive lookaheads are denoated by (?= pattern) and negative lookaheads are denoted by (?! pattern). For example, the positive lookahead t(?=s) matches the second t in the word “streets” since the second t is followed by an s whereas the first t is not. Similarly, the negative lookahead t(?!s) would match the first t in streets since it is not followed by an s.
# Positive Lookahead:
str_extract("streets", "t(?=s)")
## [1] "t"
str_locate("streets","t(?=s)")[1]
## [1] 6
#This returns the second t in streets located at position 6 since this t is followed by an s
#Negative Lookahead:
str_extract("streets","t(?!s)")
## [1] "t"
str_locate("streets","t(?!s)")[1]
## [1] 2
#This returns the first t in streets located at position 2 since this t is not followed by an s
Lookbehinds
Positive lookbehinds matches at a position if the pattern inside the lookbehind can be matched directly before the pattern. In our example, (?<=s)t would match the first t in streets since there is an “s” directly before the first t in streets.
Negative lookbehinds Similarly, negative lookbehinds tells the regex program to look backwards in the string to see if the pattern is not matched. In our example, (?<!s) matches the second t since it does not have an s in front of it.
#Positive Lookbehind:
str_extract("streets", "(?<=s)t")
## [1] "t"
str_locate("streets","(?<=s)t")[1]
## [1] 2
#This returns the first t in streets located at position 2 since this t follows an s
#Negative Lookbehind:
str_extract("streets","(?<!s)t")
## [1] "t"
str_locate("streets","(?<!s)t")[1]
## [1] 6
#This returns the second t in streets located at position 6 since this t does not follow an s