[R] Re gular Expression help

Gabor Grothendieck ggrothendieck at gmail.com
Sat Nov 8 22:36:00 CET 2008


For the problem at hand I think I would use your solution
which is both easily understood and fastest.  On the
other hand the tapply based solutions are coordinate
free (i.e. no explicit mucking with indices) and readily
generalize to more than 2 groups -- just replace [^pq] with
[^pqr], say.

On Sat, Nov 8, 2008 at 4:21 PM, Wacek Kusnierczyk
<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
> Gabor Grothendieck wrote:
>> Here are a few more solutions.  x is the input vector
>> of character strings.
>>
>> The first is a slightly shorter version of one of Wacek's.
>> The next three all create an anonymous grouping variable
>> (using sub, substr/gsub and strapply respectively)
>> whose components are "p" and "q" and then tapply
>> is used to separate out the corresponding components
>> of x according to the grouping:
>>
>> sapply(c(p = "^[^pq]*p", q = "^[^pq]*q"), grep, x = x, value = TRUE)
>>
>> tapply(x, sub("^[^pq]*(.).*", "\\1", x), c)
>>
>> tapply(x, substr(gsub("[^pq]", "", x), 1, 1), c)
>>
>> library(gsubfn)
>> tapply(x, strapply(x, "^[^pq]*(.)", simplify = c), c)
>>
>
> wow!  cool stuff.  if you're interested in comparing their efficiency,
> source the attached script.
>
> vQ
>
> generate = function(n, m)
>        replicate(n, paste(sample(letters, m, replace=TRUE), collapse=""))
>
> tests = list(
>
>        wacek =
>        function(data) {
>                p = grep("^[^pq]*p", data)
>                list(p=data[p], q=data[-p])
>        },
>
>        gabor1 =
>        function(data)
>                sapply(c(p="^[^pq]*p", q="^[^pq]*q"), grep, x=data, value=TRUE),
>
>        gabor2 =
>        function(data)
>                tapply(data, sub("^[^pq]*p(.).*", "\\1", data), c),
>
>        gabor3 =
>        function(data)
>                tapply(data, substr(gsub("[^pq]", "", data), 1, 1), c),
>
>        gabor4 =
>        { library(gsubfn); function(data)
>                tapply(data, strapply(data, "^[^pq]*(.)", simplify=c), c) }
> )
>
> data = generate(1000,10)
> lapply(names(tests),
>        function(name) {
>                cat(name, ":\n", sep="")
>                print(system.time(replicate(30,tests[[name]](data)))) } )
>
>



More information about the R-help mailing list