[Rd] extending strsplit(): supply pattern to keep, not to split by

Gabor Grothendieck ggrothendieck at gmail.com
Thu Apr 6 15:11:57 CEST 2006


To follow up, strapply has been added to the
gsubfn package (gsubfn 0.1-1) which should make it
easier to address this problem.

Its basically just a sapply call around gsubfn which
returns the transformed matches rather than performing
substitution.  Its analogous to apply:

	apply(object, margin, function)
	strapply(object, pattern, function)

(The arguments shown above are not a complete list
nor are they they actual arg names but are simply
intended to show the close parallel between strapply
and apply.)

The default function in strapply returns its
first argument so for this problem we could omit
the function altogether and write:

  library(gsubfn)  # ver 0.1-1 needed
  x <- c("12;34:56,89,,12", "1.2, .4, 1., 1e3")
  strapply(x, number.pattern)

See ?strapply for more info.


On 4/4/06, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> On 4/4/06, Bill Dunlap <bill at insightful.com> wrote:
> > On Tue, 4 Apr 2006, Gabor Grothendieck wrote:
> >
> > > gsubfn in package gsubfn can do this.  See the examples
> > > in ?gsubfn
> >
> > Thanks.  gsubfn looks useful, but may be overkill
> > for this, and it isn't vectorized.  To do what
>
> gsubfn is vectorized.  Its just that you are not using the output of
> gsubfn in this case.
>
> > strsplit(keep=T) would do I think you need to do something like:
> >
> >   > findMatches<-function(strings, pattern){
> >        lapply(strings, function(string){
> >               v <- character()
> >               gsubfn(number.pattern, function(x,...)v<<-c(v,x), string)
> >               v})
> >     }
> >   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
> >   > findMatches(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern)
> >   [[1]]
> >   [1] "12" "34" "56" "89" "12"
> >
> >   [[2]]
> >   [1] "1.2" ".4"  "1."  "1e3"
> >
> > Is this worth encapsulating in a standard R function?
>
> I will likely add a wrapper to the gsubfn package for this.
>
> > If so, is doing via an extra argument to strsplit()
> > a reasonable way to do it?
>
> My current thought was to create a strapply function to do that.
>
> >
> >   > strsplit(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern, keep=T)
> >   [[1]]:
> >   [1] "12" "34" "56" "89" "12"
> >
> >   [[2]]:
> >   [1] "1.2" ".4"  "1."  "1e3"
> >
> >
> > > On 4/4/06, Bill Dunlap <bill at insightful.com> wrote:
> > > > strsplit() is a convenient way to get a
> > > > list of items from a string when you
> > > > have a regular expression for what is not
> > > > an item.  E.g.,
> > > >
> > > >   > strsplit("1.2, 34, 1.7e-2", split="[ ,] *")
> > > >   [[1]]:
> > > >   [1] "1.2"    "34"     "1.7e-2"
> > > >
> > > > However, sometimes is it more convenient to
> > > > give a pattern for the items you do want.
> > > > E.g., suppose you want to pull all the numbers
> > > > out of a string which contains a mix of numbers
> > > > and words.  Making a pattern for what a
> > > > number is simpler than making a pattern
> > > > for what may come between the number.
> > > >   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
> > > >
> > > > I propose adding a keep=FALSE argument to
> > > > strsplit() to do this.  If keep is FALSE,
> > > > then the split argument matches the stuff to
> > > > omit from the output; if keep is TRUE then
> > > > split matches the stuff to put into the
> > > > output.  Then we could do the following to
> > > > get a list of all the numbers in a string
> > > > (done in a version of strsplit() I'm working on
> > > > for S-PLUS):
> > > >
> > > >   > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE)
> > > >   [[1]]:
> > > >   [1] "1.2"    "34"     "1.7e-2"
> > > >
> > > >   > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE)
> > > >   [[1]]:
> > > >   [1] "200"
> > > >
> > > > Is this a reasonable thing to want strsplit to do?
> > > > Is this a reasonable parameterization of it?
> >
> > ----------------------------------------------------------------------------
> > Bill Dunlap
> > Insightful Corporation
> > bill at insightful dot com
> > 360-428-8146
> >
> >  "All statements in this message represent the opinions of the author and do
> >  not necessarily reflect Insightful Corporation policy or position."
> >
>



More information about the R-devel mailing list