[Rd] match function causing bad performance when using table function on factors with multibyte characters on Windows

Karl Ove Hufthammer karl at huftis.org
Fri Jan 21 10:47:56 CET 2011

[I originally posted this on the R-help mailing list, and it was suggested that R-devel would be a better
place to dicuss it.]

Running ‘table’ on a factor with levels containing non-ASCII characters
seems to result in extremely bad performance on Windows. Here’s a simple
example with benchmark results (I’ve reduced the number of replications to
make the function finish within reasonable time):

  x.num=sample(1:2, 10^5, replace=TRUE)
  x.fac.ascii=factor(x.num, levels=1:2, labels=c("A","B"))
  x.fac.nascii=factor(x.num, levels=1:2, labels=c("Æ","Ø"))
  benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), table(unclass(x.fac.nascii)), replications=20 )
                            test replications elapsed   relative user.self sys.self user.child sys.child
  4 table(unclass(x.fac.nascii))           20    1.53   4.636364      1.51     0.01         NA        NA
  2           table(x.fac.ascii)           20    0.33   1.000000      0.33     0.00         NA        NA
  3          table(x.fac.nascii)           20  146.67 444.454545     38.52    81.74         NA        NA
  1                 table(x.num)           20    1.55   4.696970      1.53     0.01         NA        NA
  R version 2.12.1 (2010-12-16)
  Platform: i386-pc-mingw32/i386 (32-bit)
  [1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252  LC_CTYPE=Norwegian-Nynorsk_Norway.1252    LC_MONETARY=Norwegian-Nynorsk_Norway.1252
  [4] LC_NUMERIC=C                              LC_TIME=Norwegian-Nynorsk_Norway.1252   
  attached base packages:
  [1] stats     graphics  grDevices datasets  utils     methods   base    
  other attached packages:
  [1] rbenchmark_0.3

The timings are from R 2.12.1, but I also get comparable results
on the latest prelease (R 2.13.0 2011-01-18 r54032).

Running the same test (100 replications) on a Linux system with
R.12.1 Patched results in essentially no difference between the
performance on ASCII factors and non-ASCII factors:

                            test replications elapsed relative user.self sys.self user.child sys.child
  4 table(unclass(x.fac.nascii))          100   4.607 3.096102     4.455    0.092          0         0
  2           table(x.fac.ascii)          100   1.488 1.000000     1.459    0.028          0         0
  3          table(x.fac.nascii)          100   1.616 1.086022     1.560    0.051          0         0
  1                 table(x.num)          100   4.504 3.026882     4.403    0.079          0         0

  R version 2.12.1 Patched (2011-01-18 r54033)
  Platform: i686-pc-linux-gnu (32-bit)
   [1] LC_CTYPE=nn_NO.UTF-8       LC_NUMERIC=C               LC_TIME=nn_NO.UTF-8       
   [4] LC_COLLATE=nn_NO.UTF-8     LC_MONETARY=C              LC_MESSAGES=nn_NO.UTF-8   
   [7] LC_PAPER=nn_NO.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
  attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

  other attached packages:
  [1] rbenchmark_0.3

Profiling the ‘table’ function indicates almost all the time if spent in
the ‘match’ function, which is used when ‘factor’ is used on a ‘factor’
inside ‘table’. Indeed, ‘x.fac.nascii = factor(x.fac.nascii)’ by itself
is extremely slow.

Is there any theoretical reason ‘factor’ on ‘factor’ with non-ASCII
characters must be so slow? And why doesn’t this happen on Linux?

Perhaps a fix for ‘table’ might be calculating the ‘table’ statistics
*including* all levels (not using the ‘factor’ function anywhere),
and then removing the ‘exclude’ levels in the end. For example,
something along these lines:

res = table.modified.to.not.use.factor(...)
ind = lapply(dimnames(res), function(x) !(x %in% exclude))
do.call("[", c(list(res), ind, drop=FALSE))

(I haven’t tested this very much, so there may be issues with this
way of doing things.)

Karl Ove Hufthammer

More information about the R-devel mailing list