nchar(): get rid of the last few digits
Sometimes one may wish to extract substrings in a character vector from the end rather than the beginning, and a workaround is to use the nchar() function.
Working example:
text <- "12345678"
text
## [1] "12345678"
nchar(text)
## [1] 8
To keep the last four characters:
substr(text, nchar(text) - 3, nchar(text))
## [1] "5678"
Adding boundaries - “\b”, “^”, “$”
These operators help identify the exact matched pattern, which is helpful when used in combination of grep() and gsub() functions.
For instance:
string1 <- c("java", "javascript")
grep("java", string1)
## [1] 1 2
As you can see, without specifying boundary, the output indicates that both java and javascript are returned. We’ll fix that by using \b.
grep("\\bjava\\b", string1)
## [1] 1
So now we only get java returned as the output. Also, note that an extra \ is needed in the expression.
But there’s one thing that \b does not recognize: space. See this example:
string = c("yourtext", "yourtexts", "yourtext ", " yourtext ")
grep("\\byourtext\\b", string)
## [1] 1 3 4
The output tells us that \\byourtext\\b has found not only the 1st entry ("yourtext"), but also the 3rd ("yourtext ") and 4th ( yourtext ") entries as well - didn’t work as expected.
To fix that problem, ^ and $ can be used instead. They acted like a forced boundary for the start (^) and end ($) point.
string = c("yourtext", "yourtexts", "yourtext ", " yourtext ")
grep("^yourtext$", string)
## [1] 1
See also: Extract characters from string, This post on Stack Overflow.