[Rd] Feature request: non-dropping regmatches/strextract

Cyclic Group Z_1 cyc||cgroup-z1 @end|ng |rom y@hoo@com
Thu Aug 15 07:56:41 CEST 2019


A very common use case for regmatches is to extract regex matches into a new column in a data.frame (or data.table, etc.) or otherwise use the extracted strings alongside the input. However, the default behavior is to drop empty matches, which results in mismatches in column length if reassignment is done without subsetting.

For consistency with other R functions and compatibility with this use case, it would be nice if regmatches did not automatically drop empty matches and would instead insert an NA_character_ value (similar to stringr::str_extract). This alternative regmatches could be implemented through an optional drop argument, a new function, or mentioned in the documentation (a la resample in ?sample). 

Alternatively, at the moment, there is a non-exported function strextract in utils which is very similar to stringr::str_extract. It would be great if this function, once exported, were to include a drop argument to prevent dropping positions with no matches. 

An example solution (last option):

strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop = T) {
 m <- regexec(pattern, x, perl=perl, useBytes=useBytes)
 result <- regmatches(x, m)
 
 if(isTRUE(drop)){
 unlist(result)
 } else if(isFALSE(drop)) {
 unlist({result[lengths(result)==0] <- NA_character_; result})
 } else {
 stop("Invalid argument for `drop`")
 }
}

Based on Ricardo Saporta's response to How to prevent regmatches drop non matches?

--CG



More information about the R-devel mailing list