[R] Strplit code

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Thu Dec 4 14:19:48 CET 2008


Gabor Grothendieck wrote:
> R does not support tail recursion so not using it would
> not matter.
>
>   

this is a feature, i guess.

another issue is that the recursive calls receive substrings as
arguments, which means copying the whole remaining content, and the
copies would not be deallocated until after the recursive calls return,
am i right?

vQ

> On Thu, Dec 4, 2008 at 5:04 AM, Wacek Kusnierczyk
> <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>   
>> John Fox wrote:
>>     
>>> By coincidence, I have a version of strsplit() that I've used to
>>> illustrate recursion:
>>>
>>> Strsplit <- function(x, split){
>>>     if (length(x) > 1) {
>>>         return(lapply(x, Strsplit, split))  # vectorization
>>>         }
>>>     result <- character(0)
>>>     if (nchar(x) == 0) return(result)
>>>     posn <- regexpr(split, x)
>>>     if (posn <= 0) return(x)
>>>     c(result, substring(x, 1, posn - 1),
>>>         Recall(substring(x, posn+1, nchar(x)), split))  # recursion
>>>     }
>>>
>>>
>>>       
>> well, it is both inefficient and wrong.
>>
>> inefficient because of the non-tail recursion and recursive
>> concatenation, which is justified for the sake the purpose of showing
>> recursion, but for practical purposes you'd rather use gregexepr.
>>
>> wrong because of how you pick the remaining part of the string to be
>> split -- it works just under the assumption the pattern is a single
>> character:
>>
>> Strsplit("hello-dolly,--sweet", "--")
>> # the pattern is *two* hyphens
>> # [1] "hello-dolly" "-sweet"
>>
>> Strsplit("hello dolly", "")
>> # the pattern is the empty string
>> #  [1] "" "" "" "" "" "" "" "" "" "" ""
>>
>>
>> here's a quick rewrite -- i haven't tested it on extreme cases, it may
>> not be perfect, and there's a hidden source of inefficiency here as well:
>>
>> strsplit =
>> function(strings, split) {
>>    positions = gregexpr(split, strings)
>>    lapply(1:length(strings), function(i)
>>        substring(strings[[i]], c(1, positions[[i]] +
>> attr(positions[[i]], "match.length")), c(positions[[i]]-1,
>> nchar(strings[[i]]))))
>> }
>>
>>
>> n = 1000; m = 100
>> strings = replicate(n, paste(sample(c(letters, " "), 100, replace=TRUE),
>> collapse=""))
>> system.time(replicate(m, strsplit(strings, " ")))
>> system.time(replicate(m, Strsplit(strings, " ")))
>>
>>
>> vQ



More information about the R-help mailing list