[R] Frequency of a character in a string

Mon Nov 14 21:26:50 CET 2016

Hi,

FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE)
or strsplit( , fixed=TRUE):

   set.seed(1)
   Vec <- paste(sample(letters, 5000000, replace = TRUE), collapse = "")

   system.time(res1 <- nchar(gsub("[^a]", "", Vec)))
   #  user  system elapsed
   # 0.585   0.000   0.586

   system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L)
   #  user  system elapsed
   # 0.061   0.000   0.061

   system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE)))
   #  user  system elapsed
   # 0.039   0.000   0.039

   identical(res1, res2)
   # [1] TRUE
   identical(res1, res3)
   # [1] TRUE

The gsub( , fixed=TRUE) solution also uses slightly less memory than the
strsplit( , fixed=TRUE) solution.

Cheers,
H.

On 11/14/2016 11:55 AM, Charles C. Berry wrote:
> On Mon, 14 Nov 2016, Marc Schwartz wrote:
>
>>
>>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:
>>>
>>> On Mon, 14 Nov 2016, Bert Gunter wrote:
>>>
> [stuff deleted]
>
>> Hi,
>>
>> Both gsub() and strsplit() are using regex based pattern matching
>> internally. That being said, they are ultimately calling .Internal
>> code, so both are pretty fast.
>>
>> For comparison:
>>
>> ## Create a 1,000,000 character vector
>> set.seed(1)
>> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "")
>>
>>> nchar(Vec)
>> [1] 1000000
>>
>> ## Split the vector into single characters and tabulate
>>> table(strsplit(Vec, split = "")[[1]])
>>
>>    a     b     c     d     e     f     g     h     i     j     k     l
>> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
>>    m     n     o     p     q     r     s     t     u     v     w     x
>> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
>>    y     z
>> 38265 38299
>>
>>
>> ## Get just the count of "a"
>>> table(strsplit(Vec, split = "")[[1]])["a"]
>>    a
>> 38664
>>
>>> nchar(gsub("[^a]", "", Vec))
>> [1] 38664
>>
>>
>> ## Check performance
>>> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
>>   user  system elapsed
>>  0.100   0.007   0.107
>>
>>> system.time(nchar(gsub("[^a]", "", Vec)))
>>   user  system elapsed
>>  0.270   0.001   0.272
>>
>>
>> So, the above would suggest that using strsplit() is somewhat faster
>> than using gsub(). However, as Chuck notes, in the absence of more
>> exhaustive benchmarking, the difference may or may not be more
>> generalizable.
>
>
> Whether splitting on fixed strings rather than treating them as
> regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
> what you split:
>
> First repeating what Marc did...
>
>> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
>    user  system elapsed
>   0.132   0.010   0.139
>> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
>    user  system elapsed
>   0.130   0.010   0.138
>
> ... fixed=TRUE hardly matters. But the idiom I proposed...
>
>> system.time(sum(lengths(strsplit(paste0("X", Vec,
>> "X"),"a",fixed=TRUE)) - 1))
>    user  system elapsed
>   0.017   0.000   0.018
>> system.time(sum(lengths(strsplit(paste0("X", Vec,
>> "X"),"a",fixed=FALSE)) - 1))
>    user  system elapsed
>   0.104   0.000   0.104
>>
>
> ... is 5 times faster with fixed=TRUE for this case.
>
> This result matchea Marc's count:
>
>> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
> [1] 38664
>>
>
> Chuck
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319