[R] gregexpr slow and increases exponentially with string length --> how to speed it up?

Emmanuel Levy emmanuel.levy at gmail.com
Fri Oct 31 03:16:48 CET 2008


Hi Chuck,

Thanks a lot for your suggestion.

> You can find all such matches (not just the disjoint ones that gregexpr
> finds) using something like this:
>
>        twomatch <-function(x,y) intersect(x+1,y)
>        match.list <-
>                list(
>                        which( vec %in% c(3,6,7) ),
>                        which( vec == 2 ),
>                        which( vec %in% 1:9 ),
>                        which( vec %in% c(1,2,9) ) )
>        res <- Reduce( twomatch, match.list ) - length(match.list) + 1
>

I should have made explicit that I have many of these "motifs" to
match, and their structure vary quite a bit. This means that I'd need
a function to translate each motif into the solution you proposed,
which would be (although feasible), a bit painful.
In the meantime, the best solution I found is to cut the big string
into smaller strings. That actually speeds things up a lot.

Best,

E

> If you want to precisely match the gregexpr results, you'll need to filter
> out the overlapping matches.
>
> HTH,
>
> Chuck
>
>>
>> Best,
>>
>> Emmanuel
>>
>>
>>> for (i in c(10000, 50000, 100000, 500000)){
>>
>> +   aa = as.character(sample(1:9, i, replace=T))
>> +   aa = paste(aa, collapse='')
>> +   print(i)
>> +   print(system.time(gregexpr("[367]2[1-9][129]",aa)))
>> + }
>> [1] 10000
>>  user  system elapsed
>>  0.004   0.000   0.003
>> [1] 50000
>>  user  system elapsed
>>  0.060   0.000   0.061
>> [1] 1e+05
>>  user  system elapsed
>>  0.240   0.000   0.238
>> [1] 5e+05
>>  user  system elapsed
>>  5.733   0.000   5.732
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> Charles C. Berry                            (858) 534-2098
>                                            Dept of Family/Preventive
> Medicine
> E mailto:cberry at tajo.ucsd.edu               UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>
>
>



More information about the R-help mailing list