[R] interval between specific characters in a string...

Hervé Pagès hp@ge@@on@g|thub @end|ng |rom gm@||@com
Sun Dec 4 22:42:13 CET 2022


On 04/12/2022 00:25, Hadley Wickham wrote:
> On Sun, Dec 4, 2022 at 12:50 PM Hervé Pagès <hpages.on.github using gmail.com> wrote:
>> On 03/12/2022 07:21, Bert Gunter wrote:
>>> Perhaps it is worth pointing out that looping constructs like lapply() can
>>> be avoided and the procedure vectorized by mimicking Martin Morgan's
>>> solution:
>>>
>>> ## s is the string to be searched.
>>> diff(c(0,grep('b',strsplit(s,'')[[1]])))
>>>
>>> However, Martin's solution is simpler and likely even faster as the regex
>>> engine is unneeded:
>>>
>>> diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely vectorized
>>>
>>> This seems much preferable to me.
>> Of all the proposed solutions, Andrew Hart's solution seems the most
>> efficient:
>>
>>     big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000)
>>
>>     system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1)
>>     #    user  system elapsed
>>     #   0.736   0.028   0.764
>>
>>     system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]]
>> == "b"))))
>>     #    user  system elapsed
>>     #  2.100   0.356   2.455
>>
>> The bigger the string, the bigger the gap in performance.
>>
>> Also, the bigger the average gap between 2 successive b's, the bigger
>> the gap in performance.
>>
>> Finally: always use fixed=TRUE in strsplit() if you don't need to use
>> the regex engine.
> You can do a bit better if you are willing to use stringr:
>
> library(stringr)
> big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000)
>
> system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1)
> #>    user  system elapsed
> #>   0.126   0.002   0.128
>
> system.time(str_length(str_split(big_string, fixed("b"))[[1]]))
> #>    user  system elapsed
> #>   0.103   0.004   0.107
>
> (And my timings also suggest that it's time for Hervé to get a new computer :P)

LOL

Actually my timings were for

   big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 1500000)

but I mixed up things when I copy-pasted them in my email.

Even though I still need a new laptop and I'm in the process of getting 
a new one ;-)

H.

-- 
Hervé Pagès

Bioconductor Core Team
hpages.on.github using gmail.com



More information about the R-help mailing list