[R] Frequency of a character in a string

Mon Nov 14 21:44:10 CET 2016

(Sheepishly)...

Yes, thank you Hervé. It would have been nice if I had given correct
soutions. Fixed = TRUE could not have of course worked with ["a"]
character class!

Here's what I found with a 10 element vector each member of which is a
1e5 length string:

> system.time((lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1))
   user  system elapsed
  0.013   0.000   0.013

> system.time(nchar(gsub("[^a]", "", x,fixed = FALSE)))
   user  system elapsed
  0.251   0.000   0.252
## WAYYYY slower

> system.time(nchar(x) - nchar(gsub("a", "", x,fixed = TRUE)))
   user  system elapsed
  0.007   0.000   0.007
## twice as fast

Clearly and unsurprisingly, the message is to avoid fixed = FALSE;
after that, it seems mostly to be: who cares?!

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Mon, Nov 14, 2016 at 12:26 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
> Hi,
>
> FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE)
> or strsplit( , fixed=TRUE):
>
>   set.seed(1)
>   Vec <- paste(sample(letters, 5000000, replace = TRUE), collapse = "")
>
>   system.time(res1 <- nchar(gsub("[^a]", "", Vec)))
>   #  user  system elapsed
>   # 0.585   0.000   0.586
>
>   system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L)
>   #  user  system elapsed
>   # 0.061   0.000   0.061
>
>   system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE)))
>   #  user  system elapsed
>   # 0.039   0.000   0.039
>
>   identical(res1, res2)
>   # [1] TRUE
>   identical(res1, res3)
>   # [1] TRUE
>
> The gsub( , fixed=TRUE) solution also uses slightly less memory than the
> strsplit( , fixed=TRUE) solution.
>
> Cheers,
> H.
>
>
> On 11/14/2016 11:55 AM, Charles C. Berry wrote:
>>
>> On Mon, 14 Nov 2016, Marc Schwartz wrote:
>>
>>>
>>>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:
>>>>
>>>> On Mon, 14 Nov 2016, Bert Gunter wrote:
>>>>
>> [stuff deleted]
>>
>>> Hi,
>>>
>>> Both gsub() and strsplit() are using regex based pattern matching
>>> internally. That being said, they are ultimately calling .Internal
>>> code, so both are pretty fast.
>>>
>>> For comparison:
>>>
>>> ## Create a 1,000,000 character vector
>>> set.seed(1)
>>> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "")
>>>
>>>> nchar(Vec)
>>>
>>> [1] 1000000
>>>
>>> ## Split the vector into single characters and tabulate
>>>>
>>>> table(strsplit(Vec, split = "")[[1]])
>>>
>>>
>>>    a     b     c     d     e     f     g     h     i     j     k     l
>>> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
>>>    m     n     o     p     q     r     s     t     u     v     w     x
>>> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
>>>    y     z
>>> 38265 38299
>>>
>>>
>>> ## Get just the count of "a"
>>>>
>>>> table(strsplit(Vec, split = "")[[1]])["a"]
>>>
>>>    a
>>> 38664
>>>
>>>> nchar(gsub("[^a]", "", Vec))
>>>
>>> [1] 38664
>>>
>>>
>>> ## Check performance
>>>>
>>>> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
>>>
>>>   user  system elapsed
>>>  0.100   0.007   0.107
>>>
>>>> system.time(nchar(gsub("[^a]", "", Vec)))
>>>
>>>   user  system elapsed
>>>  0.270   0.001   0.272
>>>
>>>
>>> So, the above would suggest that using strsplit() is somewhat faster
>>> than using gsub(). However, as Chuck notes, in the absence of more
>>> exhaustive benchmarking, the difference may or may not be more
>>> generalizable.
>>
>>
>>
>> Whether splitting on fixed strings rather than treating them as
>> regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
>> what you split:
>>
>> First repeating what Marc did...
>>
>>> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
>>
>>    user  system elapsed
>>   0.132   0.010   0.139
>>>
>>> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
>>
>>    user  system elapsed
>>   0.130   0.010   0.138
>>
>> ... fixed=TRUE hardly matters. But the idiom I proposed...
>>
>>> system.time(sum(lengths(strsplit(paste0("X", Vec,
>>> "X"),"a",fixed=TRUE)) - 1))
>>
>>    user  system elapsed
>>   0.017   0.000   0.018
>>>
>>> system.time(sum(lengths(strsplit(paste0("X", Vec,
>>> "X"),"a",fixed=FALSE)) - 1))
>>
>>    user  system elapsed
>>   0.104   0.000   0.104
>>>
>>>
>>
>> ... is 5 times faster with fixed=TRUE for this case.
>>
>> This result matchea Marc's count:
>>
>>> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
>>
>> [1] 38664
>>>
>>>
>>
>> Chuck
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.