[Rd] Feature request: non-dropping regmatches/strextract

William Dunlap wdun|@p @end|ng |rom t|bco@com
Thu Aug 15 17:08:23 CEST 2019


Changing the default behavior of regmatches would break its use with
gregexpr, where
the number of matches per input element faries, so a zero-length character
vector
makes more sense than NA_character_.

> x <- c("John Doe", "e e cummings", "Juan de la Madrid")
> m <- gregexpr("[A-Z]", x)
> regmatches(x,m)
[[1]]
[1] "J" "D"

[[2]]
character(0)

[[3]]
[1] "J" "M"

> vapply(.Last.value, function(x)paste(paste0(x, "."),collapse=""), "")
[1] "J.D." "."    "J.M."

(We don't want e e cummings initials mapped to "NA.")

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Aug 15, 2019 at 12:15 AM Cyclic Group Z_1 via R-devel <
r-devel using r-project.org> wrote:

> A very common use case for regmatches is to extract regex matches into a
> new column in a data.frame (or data.table, etc.) or otherwise use the
> extracted strings alongside the input. However, the default behavior is to
> drop empty matches, which results in mismatches in column length if
> reassignment is done without subsetting.
>
> For consistency with other R functions and compatibility with this use
> case, it would be nice if regmatches did not automatically drop empty
> matches and would instead insert an NA_character_ value (similar to
> stringr::str_extract). This alternative regmatches could be implemented
> through an optional drop argument, a new function, or mentioned in the
> documentation (a la resample in ?sample).
>
> Alternatively, at the moment, there is a non-exported function strextract
> in utils which is very similar to stringr::str_extract. It would be great
> if this function, once exported, were to include a drop argument to prevent
> dropping positions with no matches.
>
> An example solution (last option):
>
> strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop =
> T) {
>  m <- regexec(pattern, x, perl=perl, useBytes=useBytes)
>  result <- regmatches(x, m)
>
>  if(isTRUE(drop)){
>  unlist(result)
>  } else if(isFALSE(drop)) {
>  unlist({result[lengths(result)==0] <- NA_character_; result})
>  } else {
>  stop("Invalid argument for `drop`")
>  }
> }
>
> Based on Ricardo Saporta's response to How to prevent regmatches drop non
> matches?
>
> --CG
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list