[Rd] extending strsplit(): supply pattern to keep, not to split by

Bill Dunlap bill at insightful.com
Tue Apr 4 17:54:17 CEST 2006


strsplit() is a convenient way to get a
list of items from a string when you
have a regular expression for what is not
an item.  E.g.,

   > strsplit("1.2, 34, 1.7e-2", split="[ ,] *")
   [[1]]:
   [1] "1.2"    "34"     "1.7e-2"

However, sometimes is it more convenient to
give a pattern for the items you do want.
E.g., suppose you want to pull all the numbers
out of a string which contains a mix of numbers
and words.  Making a pattern for what a
number is simpler than making a pattern
for what may come between the number.
   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"

I propose adding a keep=FALSE argument to
strsplit() to do this.  If keep is FALSE,
then the split argument matches the stuff to
omit from the output; if keep is TRUE then
split matches the stuff to put into the
output.  Then we could do the following to
get a list of all the numbers in a string
(done in a version of strsplit() I'm working on
for S-PLUS):

   > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE)
   [[1]]:
   [1] "1.2"    "34"     "1.7e-2"

   > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE)
   [[1]]:
   [1] "200"

Is this a reasonable thing to want strsplit to do?
Is this a reasonable parameterization of it?

----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."



More information about the R-devel mailing list