[Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows

Mon Jan 24 18:30:38 CET 2011

I'm not sure, but note the difference in locale between
Linux (UTF-8) and Windows (non UTF-8). As far as I
understand it R much prefers UTF-8, which Windows doesn't
natively support. Otherwise you could just change your
Windows locale to a UTF-8 locale to make R happier.

My stab in the dark would be that the poor performance on
Windows in this case may be down to many calls to
translateCharUTF8 internally.

There was a change in R 2.12.0 in this area. Running your
test in R 2.11.1 on Windows shows the same problem though
so it doesn't look like that change caused this problem.

>From NEWS 2.12.0 :
o  unique() and match() are now faster on character vectors
    where all elements are in the global CHARSXP cache and
    have unmarked encoding (ASCII). Thanks to Matthew
    Dowle for suggesting improvements to the way the hash
    code is generated in 'unique.c'

If anybody knows a way to trick R on Linux into thinking it has
an encoding similar to Windows then I may be able to take a
look if I can reproduce the problem in Linux.

Matthew

"Karl Ove Hufthammer" <karl at huftis.org> wrote in message 
news:ihbko3$efs$1 at dough.gmane.org...
> [I originally posted this on the R-help mailing list, and it was suggested 
> that R-devel would be a better
> place to dicuss it.]
>
> Running 'table' on a factor with levels containing non-ASCII characters
> seems to result in extremely bad performance on Windows. Here's a simple
> example with benchmark results (I've reduced the number of replications to
> make the function finish within reasonable time):
>
>  library(rbenchmark)
>  x.num=sample(1:2, 10^5, replace=TRUE)
>  x.fac.ascii=factor(x.num, levels=1:2, labels=c("A","B"))
>  x.fac.nascii=factor(x.num, levels=1:2, labels=c("Æ","Ø"))
>  benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), 
> table(unclass(x.fac.nascii)), replications=20 )
>
>                            test replications elapsed   relative user.self 
> sys.self user.child sys.child
>  4 table(unclass(x.fac.nascii))           20    1.53   4.636364      1.51 
> 0.01         NA        NA
>  2           table(x.fac.ascii)           20    0.33   1.000000      0.33 
> 0.00         NA        NA
>  3          table(x.fac.nascii)           20  146.67 444.454545     38.52 
> 81.74         NA        NA
>  1                 table(x.num)           20    1.55   4.696970      1.53 
> 0.01         NA        NA
>
>  sessionInfo()
>  R version 2.12.1 (2010-12-16)
>  Platform: i386-pc-mingw32/i386 (32-bit)
>
>  locale:
>  [1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 
> LC_CTYPE=Norwegian-Nynorsk_Norway.1252 
> LC_MONETARY=Norwegian-Nynorsk_Norway.1252
>  [4] LC_NUMERIC=C 
> LC_TIME=Norwegian-Nynorsk_Norway.1252
>
>  attached base packages:
>  [1] stats     graphics  grDevices datasets  utils     methods   base
>
>  other attached packages:
>  [1] rbenchmark_0.3
>
> The timings are from R 2.12.1, but I also get comparable results
> on the latest prelease (R 2.13.0 2011-01-18 r54032).
>
> Running the same test (100 replications) on a Linux system with
> R.12.1 Patched results in essentially no difference between the
> performance on ASCII factors and non-ASCII factors:
>
>                            test replications elapsed relative user.self 
> sys.self user.child sys.child
>  4 table(unclass(x.fac.nascii))          100   4.607 3.096102     4.455 
> 0.092          0         0
>  2           table(x.fac.ascii)          100   1.488 1.000000     1.459 
> 0.028          0         0
>  3          table(x.fac.nascii)          100   1.616 1.086022     1.560 
> 0.051          0         0
>  1                 table(x.num)          100   4.504 3.026882     4.403 
> 0.079          0         0
>
>  sessionInfo()
>  R version 2.12.1 Patched (2011-01-18 r54033)
>  Platform: i686-pc-linux-gnu (32-bit)
>
>  locale:
>   [1] LC_CTYPE=nn_NO.UTF-8       LC_NUMERIC=C 
> LC_TIME=nn_NO.UTF-8
>   [4] LC_COLLATE=nn_NO.UTF-8     LC_MONETARY=C 
> LC_MESSAGES=nn_NO.UTF-8
>   [7] LC_PAPER=nn_NO.UTF-8       LC_NAME=C                  LC_ADDRESS=C
>  [10] LC_TELEPHONE=C             LC_MEASUREMENT=nn_NO.UTF-8 
> LC_IDENTIFICATION=C
>
>  attached base packages:
>  [1] stats     graphics  grDevices utils     datasets  methods   base
>
>  other attached packages:
>  [1] rbenchmark_0.3
>
> Profiling the 'table' function indicates almost all the time if spent in
> the 'match' function, which is used when 'factor' is used on a 'factor'
> inside 'table'. Indeed, 'x.fac.nascii = factor(x.fac.nascii)' by itself
> is extremely slow.
>
> Is there any theoretical reason 'factor' on 'factor' with non-ASCII
> characters must be so slow? And why doesn't this happen on Linux?
>
> Perhaps a fix for 'table' might be calculating the 'table' statistics
> *including* all levels (not using the 'factor' function anywhere),
> and then removing the 'exclude' levels in the end. For example,
> something along these lines:
>
> res = table.modified.to.not.use.factor(...)
> ind = lapply(dimnames(res), function(x) !(x %in% exclude))
> do.call("[", c(list(res), ind, drop=FALSE))
>
> (I haven't tested this very much, so there may be issues with this
> way of doing things.)
>
> -- 
> Karl Ove Hufthammer
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>