[Rd] Deep copy of factor levels?

Kirill Müller kirill.mueller at ivt.baug.ethz.ch
Mon Mar 17 10:13:37 CET 2014


Hi


It seems that selecting an element of a factor will copy its levels 
(Ubuntu 13.04, R 3.0.2). Below is the output of a script that creates a 
factor with 10000 elements and then calls as.list() on it. The new 
object seems to use more than 700 MB, and inspection of the levels of 
the individual elements of the list suggest that they are distinct objects.

Perhaps some performance gain could be achieved by copying the levels 
"by reference", but I don't know R internals well enough to see if it's 
possible. Is there a particular reason for creating a full copy of the 
factor levels?

This has come up when looking at the performance of rbind.fill (in the 
plyr package) with factors: https://github.com/hadley/plyr/issues/206 .


Best regards

Kirill



 > gc()
           used (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells  325977 17.5    1074393  57.4  10049951  536.8
Vcells 4617168 35.3   87439742 667.2 204862160 1563.0
 > system.time(x <- factor(seq_len(1e4)))
    user  system elapsed
   0.008   0.000   0.007
 > system.time(xx <- as.list(x))
    user  system elapsed
   4.263   0.000   4.322
 > gc()
             used  (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells    385991  20.7    1074393  57.4  10049951  536.8
Vcells 104672187 798.6  112367694 857.3 204862160 1563.0
 > .Internal(inspect(levels(xx[[1]])))
@387f620 16 STRSXP g1c7 [MARK,NAM(2)] (len=10000, tl=0)
   @144da4e8 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "1"
   @144da518 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "2"
   @27d1298 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "3"
   @144da548 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "4"
   @144da578 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "5"
   ...
 > .Internal(inspect(levels(xx[[2]])))
@1b38cb90 16 STRSXP g1c7 [MARK,NAM(2)] (len=10000, tl=0)
   @144da4e8 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "1"
   @144da518 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "2"
   @27d1298 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "3"
   @144da548 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "4"
   @144da578 09 CHARSXP g1c1 [MARK,gp=0x60] [ASCII] [cached] "5"
   ...



More information about the R-devel mailing list