[R] Split String in regex while Keeping Delimiter

Thu Apr 13 21:15:32 CEST 2023

Dear Emily,

I have written a more robust version of the function:
extract.nonLetters = function(x, rm.space = TRUE, normalize=TRUE, 
sort=TRUE) {
     if(normalize) str = stringi::stri_trans_nfc(str);
     ch = strsplit(str, "", fixed = TRUE);
     ch = unique(unlist(ch));
     if(sort) ch = sort(ch);
     pat = if(rm.space) "^[a-zA-Z ]" else "^[a-zA-Z]";
     isLetter = grepl(pat, ch);
     ch = ch[ ! isLetter];
     return(stringi::stri_escape_unicode(ch));
}
extract.nonLetters(str)
# "\\u2013" "+"

This code ("\u2013") is included in the expanded Regex expression:
tokens = strsplit(str, "(?<=[-+\u2010-\u2014])\\s++", perl=TRUE)

Sincerely,

Leonard

On 4/13/2023 9:40 PM, Leonard Mada wrote:
> Dear Emily,
>
> Using a look-behind solves the split problem in this case. (Note: 
> Using Regex is in most/many cases the simplest solution.)
>
> str = c("leucocyten + gramnegatieve staven +++ grampositieve staven ++",
> "leucocyten – grampositieve coccen +")
>
> tokens = strsplit(str, "(?<=[-+])\\s++", perl=TRUE)
>
> PROBLEM
> The current expression does NOT work for a different reason: the "-" 
> is coded using a NON-ASCII character.
>
> I have written a small utility function to approximately extract 
> "non-standard" characters:
> ### Identify non-ASCII Characters
> # beware: the filtering and the sorting may break the codes;
> extract.nonLetters = function(x, rm.space = TRUE, sort=FALSE) {
>     code = as.numeric(unique(unlist(lapply(x, charToRaw))));
>     isLetter =
>         (code >= 97 & code <= 122) |
>         (code >= 65 & code <= 90);
>     code = code[ ! isLetter];
>     if(rm.space) {
>         # removes only simple space!
>         code = code[code != 32];
>     }
>     if(sort) code = sort(code);
>     return(code);
> }
> extract.nonLetters(str, sort = FALSE)
> # 43 226 128 147
>
> Note:
> - the code for "+" is 43, and for simple "-" is 45: as.numeric 
> (charToRaw("+-"));
> - "226 128 147" codes something else, but it is not trivial to get the 
> Unicode code Point;
> https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=dec 
>
>
> The following is a more comprehensive Regex expression, which accepts 
> many variants of "-":
> tokens = strsplit(str, "(?<=[-+\u2010-\u2014])\\s++", perl=TRUE)
>
> Sincerely,
>
> Leonard
>
>