[R] Frequency of a character in a string

William Dunlap wdunlap at tibco.com
Mon Nov 14 21:57:17 CET 2016


Here is another variant, v3, and a change to your first example
so it returns the same value as your second example.

> set.seed(1001)
> x <- sapply(1:100,
function(x)paste0(sample(letters,rpois(1,1e5),rep=TRUE),collapse = ""))
> system.time(v1 <- lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) -
1)
   user  system elapsed
   0.47    0.00    0.49
> system.time(v2 <- nchar(gsub("[^a]", "", x)))
   user  system elapsed
   2.53    0.00    2.53
> system.time(v3 <- nchar(x) - nchar(gsub("a", "", x, fixed=TRUE)))
   user  system elapsed
   0.08    0.00    0.08
>
> all.equal(v1,v2)
[1] TRUE
> all.equal(v1,v3)
[1] TRUE


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Nov 14, 2016 at 12:23 PM, Bert Gunter <bgunter.4567 at gmail.com>
wrote:

> Chuck, Marc, and anyone else who still has interest in this odd little
> discussion ...
>
> Yes, and with fixed = TRUE my approach took 1/3 as much time as
> Chuck's with a 10 element vector each element of which is a character
> string of length 1e5:
>
> > set.seed(1001)
> > x <- sapply(1:10, function(x)paste0(sample(letters,1e5,rep=TRUE),collapse
> = ""))
>
> > system.time(sum(lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) -
> 1))
>    user  system elapsed
>   0.012   0.000   0.012
> > system.time(nchar(gsub("[^a]", "", x,fixed = TRUE)))
>    user  system elapsed
>   0.004   0.000   0.004
>
> Best,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Nov 14, 2016 at 11:55 AM, Charles C. Berry <ccberry at ucsd.edu>
> wrote:
> > On Mon, 14 Nov 2016, Marc Schwartz wrote:
> >
> >>
> >>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu>
> wrote:
> >>>
> >>> On Mon, 14 Nov 2016, Bert Gunter wrote:
> >>>
> > [stuff deleted]
> >
> >
> >> Hi,
> >>
> >> Both gsub() and strsplit() are using regex based pattern matching
> >> internally. That being said, they are ultimately calling .Internal
> code, so
> >> both are pretty fast.
> >>
> >> For comparison:
> >>
> >> ## Create a 1,000,000 character vector
> >> set.seed(1)
> >> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "")
> >>
> >>> nchar(Vec)
> >>
> >> [1] 1000000
> >>
> >> ## Split the vector into single characters and tabulate
> >>>
> >>> table(strsplit(Vec, split = "")[[1]])
> >>
> >>
> >>    a     b     c     d     e     f     g     h     i     j     k     l
> >> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
> >>    m     n     o     p     q     r     s     t     u     v     w     x
> >> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
> >>    y     z
> >> 38265 38299
> >>
> >>
> >> ## Get just the count of "a"
> >>>
> >>> table(strsplit(Vec, split = "")[[1]])["a"]
> >>
> >>    a
> >> 38664
> >>
> >>> nchar(gsub("[^a]", "", Vec))
> >>
> >> [1] 38664
> >>
> >>
> >> ## Check performance
> >>>
> >>> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
> >>
> >>   user  system elapsed
> >>  0.100   0.007   0.107
> >>
> >>> system.time(nchar(gsub("[^a]", "", Vec)))
> >>
> >>   user  system elapsed
> >>  0.270   0.001   0.272
> >>
> >>
> >> So, the above would suggest that using strsplit() is somewhat faster
> than
> >> using gsub(). However, as Chuck notes, in the absence of more exhaustive
> >> benchmarking, the difference may or may not be more generalizable.
> >
> >
> >
> > Whether splitting on fixed strings rather than treating them as
> > regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
> > what you split:
> >
> > First repeating what Marc did...
> >
> >> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
> >
> >    user  system elapsed
> >   0.132   0.010   0.139
> >>
> >> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
> >
> >    user  system elapsed
> >   0.130   0.010   0.138
> >
> > ... fixed=TRUE hardly matters. But the idiom I proposed...
> >
> >> system.time(sum(lengths(strsplit(paste0("X", Vec,
> "X"),"a",fixed=TRUE)) -
> >> 1))
> >
> >    user  system elapsed
> >   0.017   0.000   0.018
> >>
> >> system.time(sum(lengths(strsplit(paste0("X", Vec,
> "X"),"a",fixed=FALSE)) -
> >> 1))
> >
> >    user  system elapsed
> >   0.104   0.000   0.104
> >>
> >>
> >
> > ... is 5 times faster with fixed=TRUE for this case.
> >
> > This result matchea Marc's count:
> >
> >> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
> >
> > [1] 38664
> >>
> >>
> >
> > Chuck
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list