[Rd] extending strsplit(): supply pattern to keep, not to split by

Bill Dunlap bill at insightful.com
Tue Apr 4 17:54:17 CEST 2006

strsplit() is a convenient way to get a
list of items from a string when you
have a regular expression for what is not
an item.  E.g.,

   > strsplit("1.2, 34, 1.7e-2", split="[ ,] *")
   [1] "1.2"    "34"     "1.7e-2"

However, sometimes is it more convenient to
give a pattern for the items you do want.
E.g., suppose you want to pull all the numbers
out of a string which contains a mix of numbers
and words.  Making a pattern for what a
number is simpler than making a pattern
for what may come between the number.
   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"

I propose adding a keep=FALSE argument to
strsplit() to do this.  If keep is FALSE,
then the split argument matches the stuff to
omit from the output; if keep is TRUE then
split matches the stuff to put into the
output.  Then we could do the following to
get a list of all the numbers in a string
(done in a version of strsplit() I'm working on
for S-PLUS):

   > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE)
   [1] "1.2"    "34"     "1.7e-2"

   > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE)
   [1] "200"

Is this a reasonable thing to want strsplit to do?
Is this a reasonable parameterization of it?

Bill Dunlap
Insightful Corporation
bill at insightful dot com

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."

More information about the R-devel mailing list