[R] Strplit code

John Fox jfox at mcmaster.ca
Thu Dec 4 13:14:06 CET 2008


Dear Wacek,

"Wrong" is a bit strong, I think -- limited to single-pattern characters is
more accurate. Moreover, it isn't hard to make the function work with
multiple-character matches as well:

Strsplit <- function(x, split){
    if (length(x) > 1) {
        return(lapply(x, Strsplit, split))  # vectorization
        }
    result <- character(0)
    if (nchar(x) == 0) return(result)
    posn <- regexpr(split, x)
    if (posn <= 0) return(x)
    c(result, substring(x, 1, posn - 1), 
        Recall(substring(x, posn + attr(posn, "match.length"), 
          nchar(x)), split))  # recursion
    }

On the other hand, your function is much more efficient.

Regards,
 John 

------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox


> -----Original Message-----
> From: Wacek Kusnierczyk [mailto:Waclaw.Marcin.Kusnierczyk at idi.ntnu.no]
> Sent: December-04-08 5:05 AM
> To: John Fox
> Cc: R help
> Subject: Re: [R] Strplit code
> 
> John Fox wrote:
> > By coincidence, I have a version of strsplit() that I've used to
> > illustrate recursion:
> >
> > Strsplit <- function(x, split){
> >     if (length(x) > 1) {
> >         return(lapply(x, Strsplit, split))  # vectorization
> >         }
> >     result <- character(0)
> >     if (nchar(x) == 0) return(result)
> >     posn <- regexpr(split, x)
> >     if (posn <= 0) return(x)
> >     c(result, substring(x, 1, posn - 1),
> >         Recall(substring(x, posn+1, nchar(x)), split))  # recursion
> >     }
> >
> >
> 
> well, it is both inefficient and wrong.
> 
> inefficient because of the non-tail recursion and recursive
> concatenation, which is justified for the sake the purpose of showing
> recursion, but for practical purposes you'd rather use gregexepr.
> 
> wrong because of how you pick the remaining part of the string to be
> split -- it works just under the assumption the pattern is a single
> character:
> 
> Strsplit("hello-dolly,--sweet", "--")
> # the pattern is *two* hyphens
> # [1] "hello-dolly" "-sweet"
> 
> Strsplit("hello dolly", "")
> # the pattern is the empty string
> #  [1] "" "" "" "" "" "" "" "" "" "" ""
> 
> 
> here's a quick rewrite -- i haven't tested it on extreme cases, it may
> not be perfect, and there's a hidden source of inefficiency here as well:
> 
> strsplit =
> function(strings, split) {
>     positions = gregexpr(split, strings)
>     lapply(1:length(strings), function(i)
>         substring(strings[[i]], c(1, positions[[i]] +
> attr(positions[[i]], "match.length")), c(positions[[i]]-1,
> nchar(strings[[i]]))))
> }
> 
> 
> n = 1000; m = 100
> strings = replicate(n, paste(sample(c(letters, " "), 100, replace=TRUE),
> collapse=""))
> system.time(replicate(m, strsplit(strings, " ")))
> system.time(replicate(m, Strsplit(strings, " ")))
> 
> 
> vQ



More information about the R-help mailing list