[Rd] Feature request: non-dropping regmatches/strextract

Michael Lawrence |@wrence@m|ch@e| @end|ng |rom gene@com
Fri Aug 30 05:44:02 CEST 2019


Just started thinking about this. The name of regmatches() suggests
that it will only extract the matches but not return anything for the
non-matches. We might need another function that returns a value for
non-matches. Perhaps the value should be the empty string for
non-matches and NA for matches to NA. The rationale is that we
delegate to regexpr() (at least conceptually), and it returns a
"matching region" which would be empty when there is no match. We
could allow strcapture() to accept an atomic vector as a prototype,
which would do what you want for regexec() (NA on no match, empty
string on empty capture). Then we could call the regexpr()-based
function strextract().

What do you think?

Michael

On Thu, Aug 29, 2019 at 3:27 PM Cyclic Group Z_1
<cyclicgroup-z1 using yahoo.com> wrote:
>
> Thank you! I greatly appreciate your consideration, though of course it is up to you. I think many people switch to stringr/stringi simply because functions in those packages have some consistent design choices, for example, they do not drop empty/missing matches, which facilitates array-based programming. For example, in the cases where one needs to make a new column in a data.frame (data.table, tibble, etc.) of regex extractions. Or in any other case where there needs to be an element-wise correspondence between input and output. I think insertion of NA_character_ to prevent dropping indices seems like the natural choice for an array language (which, I think, motivated the creation of stringr/stringi). While those are great packages and this behavior can be easily replicated with simple wrappers, string operations are normally easy to accomplish in base languages, so this seems like something that would be appropriate to have in base. For example, MATLAB and Pandas regex both allow non-dropping empty matches (though of course I acknowledge Pandas is not a base language).
>
> Best,
> CG



-- 
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
michafla using gene.com

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube



More information about the R-devel mailing list