2 min read

Notes on string editing

nchar(): get rid of the last few digits

Sometimes one may wish to extract substrings in a character vector from the end rather than the beginning, and a workaround is to use the nchar() function.

Working example:

text <- "12345678"
text
## [1] "12345678"
nchar(text)
## [1] 8

To keep the last four characters:

substr(text, nchar(text) - 3, nchar(text))
## [1] "5678"

Adding boundaries - “\b”, “^”, “$”

These operators help identify the exact matched pattern, which is helpful when used in combination of grep() and gsub() functions.

For instance:

string1 <- c("java", "javascript")

grep("java", string1)
## [1] 1 2

As you can see, without specifying boundary, the output indicates that both java and javascript are returned. We’ll fix that by using \b.

grep("\\bjava\\b", string1)
## [1] 1

So now we only get java returned as the output. Also, note that an extra \ is needed in the expression.

But there’s one thing that \b does not recognize: space. See this example:

string = c("yourtext", "yourtexts", "yourtext ", " yourtext ")

grep("\\byourtext\\b", string)
## [1] 1 3 4

The output tells us that \\byourtext\\b has found not only the 1st entry ("yourtext"), but also the 3rd ("yourtext ") and 4th ( yourtext ") entries as well - didn’t work as expected.

To fix that problem, ^ and $ can be used instead. They acted like a forced boundary for the start (^) and end ($) point.

string = c("yourtext", "yourtexts", "yourtext ", " yourtext ")

grep("^yourtext$", string)
## [1] 1

See also: Extract characters from string, This post on Stack Overflow.