[Rd] Efficiency of factor objects

Sat Nov 5 00:19:33 CET 2011

R factors are the natural way to represent factors -- and should be
efficient since they use small integers.  But in fact, for many (but
not all) operations, R factors are considerably slower than integers,
or even character strings.  This appears to be because whenever a
factor vector is subsetted, the entire levels vector is copied.  For
example:

> i1 <- sample(1e4,1e6,replace=T)
> c1 <- paste('x',i1)
> f1 <- factor(c1)
> system.time(replicate(1e4,{q1<-i1[100:200];1}))
   user  system elapsed
   0.03    0.00    0.04
> system.time(replicate(1e4,{q1<-c1[100:200];1}))
   user  system elapsed
   0.04    0.00    0.04
> system.time(replicate(1e4,{q1<-f1[100:200];1}))
   user  system elapsed
   0.67    0.00    0.68

Putting the levels vector in an environment speeds up subsetting:

myfactor <- function(...) {
     f <- factor(...)
     g <- unclass(f)
     class(g) <- "myfactor"
     attr(g,"mylevels") <- as.environment(list(levels=attr(f,"mylevels")))
     g }
`[.myfactor` <-
function (x, ...)
{
    y <- NextMethod("[")
    attributes(y) <- attributes(x)
    y
}

> m1 <- myfactor(f1)
> system.time(replicate(1e4,{q1<-m1[100:200];1}))
   user  system elapsed
   0.05    0.00    0.04

Given R's value semantics, I believe this approach can be extended to
most of class factor's functionality without problems, copying the
environment if necessary.  Some quick tests seem to show that this is
no slower than ordinary factors even for very small numbers of levels.
 To do this, appropriate methods for this class (print, [<-, levels<-,
etc.) would have to be written. Perhaps some core R functions also
have to be changed?

Am I missing some obvious flaw in this approach?  Has anyone already
implemented a factors package using this or some similar approach?

Thanks,

             -s