[R] Split String in regex while Keeping Delimiter

Leonard Mada |eo@m@d@ @end|ng |rom @yon|c@eu
Thu Apr 13 20:40:24 CEST 2023


Dear Emily,

Using a look-behind solves the split problem in this case. (Note: Using 
Regex is in most/many cases the simplest solution.)

str = c("leucocyten + gramnegatieve staven +++ grampositieve staven ++",
"leucocyten – grampositieve coccen +")

tokens = strsplit(str, "(?<=[-+])\\s++", perl=TRUE)

PROBLEM
The current expression does NOT work for a different reason: the "-" is 
coded using a NON-ASCII character.

I have written a small utility function to approximately extract 
"non-standard" characters:
### Identify non-ASCII Characters
# beware: the filtering and the sorting may break the codes;
extract.nonLetters = function(x, rm.space = TRUE, sort=FALSE) {
     code = as.numeric(unique(unlist(lapply(x, charToRaw))));
     isLetter =
         (code >= 97 & code <= 122) |
         (code >= 65 & code <= 90);
     code = code[ ! isLetter];
     if(rm.space) {
         # removes only simple space!
         code = code[code != 32];
     }
     if(sort) code = sort(code);
     return(code);
}
extract.nonLetters(str, sort = FALSE)
# 43 226 128 147

Note:
- the code for "+" is 43, and for simple "-" is 45: as.numeric 
(charToRaw("+-"));
- "226 128 147" codes something else, but it is not trivial to get the 
Unicode code Point;
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=dec

The following is a more comprehensive Regex expression, which accepts 
many variants of "-":
tokens = strsplit(str, "(?<=[-+\u2010-\u2014])\\s++", perl=TRUE)

Sincerely,

Leonard



More information about the R-help mailing list