[R] help with regexpr in gsub

Marc Schwartz marc_schwartz at comcast.net
Thu Jan 18 14:04:15 CET 2007


On Thu, 2007-01-18 at 04:49 +0000, Prof Brian Ripley wrote:
> One thing to watch with experiments like this is that the locale will 
> matter.  Character operations will be faster in a single-byte locale (as 
> used here) than in a variable-byte locale (and I suspect Seth and Marc 
> used UTF-8), and the relative speeds may alter.  Also, the PCRE regexps 
> are often much faster, and 'useBytes' can be much faster with ASCII data 
> in UTF-8.
> 
> For example:
> 
> # R-devel, x86_64 Linux
> library(GO)
> goids <- ls(GOTERM)
> gids <- paste(goids, "ISS", sep=".")
> go.ids <- rep(gids, 10)
> > length(go.ids)
> [1] 205950
> 
> # In en_GB (single byte)
> 
> > system.time(z <- gsub("[.].*", "", go.ids))
>     user  system elapsed
>    1.709   0.004   1.716
> > system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE))
>     user  system elapsed
>    0.241   0.004   0.246
> 
> > system.time(z <- gsub('\\..+$','', go.ids))
>     user  system elapsed
>    2.254   0.018   2.286
> > system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
>     user  system elapsed
>    2.890   0.002   2.895
> > system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
>     user  system elapsed
>    2.716   0.002   2.721
> > system.time(z <- sub("\\..+", "", go.ids))
>     user  system elapsed
>    1.724   0.001   1.725
> > system.time(z <- substr(go.ids, 0, 10))
>     user  system elapsed
>    0.084   0.000   0.084
> 
> # in en_GB.utf8
> 
> > system.time(z <- gsub("[.].*", "", go.ids))
>     user  system elapsed
>    1.689   0.020   1.712
> > system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE))
>     user  system elapsed
>    0.718   0.017   0.736
> > system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE, useByte=TRUE))
>     user  system elapsed
>    0.243   0.001   0.244
> 
> > system.time(z <- gsub('\\..+$','', go.ids))
>     user  system elapsed
>    2.509   0.024   2.537
> > system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
>     user  system elapsed
>    3.772   0.004   3.779
> > system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
>     user  system elapsed
>    4.088   0.007   4.099
> > system.time(z <- sub("\\..+", "", go.ids))
>     user  system elapsed
>    1.920   0.004   1.927
> > system.time(z <- substr(go.ids, 0, 10))
>     user  system elapsed
>    0.096   0.002   0.098
> 
> substr still wins, but by a much smaller margin.

<snip>

Just to confirm Prof. Ripley's suspicion, that I am indeed running in
en_US.UTF-8.

Thanks for taking the time to point this out.

Best regards,

Marc



More information about the R-help mailing list