[Rd] efficiency and memory use of S4 data objects

Thu Aug 21 12:05:02 MEST 2003

The general question is certainly worth discussing, but I'd be surprised
if your example is measuring what you think it is.

The numeric computations are almost the only thing NOT radically changed
between your two examples.  In the first,  you are applying a
"primitive" set of functions ("+", "$", and "$<-") to a basic vector. 
These functions go directly to C code, without creating a context (aka
frame) as would a call to an S-language function.  In the second
example, the "+" will still be done in C code, with essentially no
change since the arguments will still be basic vectors.

Just about everything else, however, will be different.  If you really
wanted to focus on the numeric computation, your second example would be
more relevant with the loop being
  for(i in 1:iter)object at x <- object at x+1
In this case, the difference likely will be mainly the overhead for
"@<-", which is not a primitive.  The example as written is adding a
layer of functions and method dispatch.

But the lesson we've learned over many years (and with S generally, not
specifically with methods and classes) is that empirical inference about
efficiency is a subtle thing (somehow as statisticians you'd think we
would expect that).  Artifical examples have to be very carefully
designed and analysed before being taken at face value.

R has some useful tools, especially Rprof, to look at examples in the
hope of finding "hot spots".  It would be good to see some results,
especially for realistic examples.

Anyway, on the general question.

1.  Yes, there are lots of possibilities for speeding up method dispatch
& hopefully these will get a chance to be tried out, after 1.8.  But I
would caution people expecting that method dispatch is the hot spot
_generally_.  On a couple of occasions, it was because of introduced
glitches, and then the effect was obvious.  There are some indirect
costs, such as creating a separate context for the method, and if these
are shown to be an issue, something might be done.

2. Memory use and the effect on garbage collection:  Not too much has
been studied here & some good data would be helpful.  (Especially if
some experts on storage management in R could offer advice.)

3.  It might be more effective (and certainly more fun) to think of
changes in the context of "modernizing" some of the computations in R
generally.  There have been several suggestions discussed that in
principle could speed up method/class computations, along with providing
other new features.

4. Meanwhile, the traditional S style that has worked well probably
applies.  First, try out a variety of analyses taking advantage of
high-level concepts to program quickly.  Then, when it's clear that
something needs to be applied extensively, try to identify critical
computations that could be mapped into lower-level versions (maybe even
C code), getting efficiency by giving up flexibility.

Regards,
 John

Gordon Smyth wrote:
> 
> I do lots of analyses on large microarray data sets so memory use and speed
> and both important issues for me. I have been trying to estimate the
> overheads associated with using formal S4 data objects instead of ordinary
> lists for large data objects. In some simple experiments (using R 1.7.1 in
> Windows 2000) with large but simple objects it seems that giving a data
> object a formal class definition and using extractor and assignment
> functions may increase both memory usage and the time taken by simple
> numeric operations by several fold.
> 
> Here is a test function which uses a list representation to add 1 to the
> elements of a long numeric vector:
> 
> addlist <- function(len,iter) {
>     object <- list(x=rnorm(len))
>     for (i in 1:iter) object$x <- object$x+1
>     object
> }
> 
> Typical times on my machine are:
> 
>  > system.time(a <- addlist(10^6,10))
> [1] 2.91 0.00 2.96   NA   NA
>  > system.time(addlist(10^7,10))
> [1] 28.03  0.44 28.65    NA    NA
> 
> Here is a test function doing the same operation with a formal S4 data
> representation:
> 
> addS4 <- function(len,iter) {
>    object <- new("MyClass",x=rnorm(len))
>    for (i in 1:iter) x(object) <- x(object)+1
>    object
> }
> 
> The timing with len=10^6 increases to
> 
>  > system.time(a <- addS4(10^6,10))
> [1] 6.79 0.06 6.90   NA   NA
> 
> With len=10^7 the operation fails altogether due to insufficient memory
> after thrashing around with virtual memory for a very long time.
> 
> I guess I'm not surprised by the performance penalty with S4. My question
> is: is the performance penalty likely to be an ongoing feature of using S4
> methods or will it likely go away in future versions of R?
> 
> Thanks
> Gordon
> 
> Here are my S4 definitions:
> 
> setClass("MyClass",representation(x="numeric"))
> setGeneric("x",function(object) standardGeneric("x"))
> setMethod("x","MyClass",function(object) object at x)
> setGeneric("x<-", function(object, value) standardGeneric("x<-"))
> setReplaceMethod("x","MyClass",function(object,value) {object at x <- value;
> return(object)})
> 
>  > version
>              _
> platform i386-pc-mingw32
> arch     i386
> os       mingw32
> system   i386, mingw32
> status
> major    1
> minor    7.1
> year     2003
> month    06
> day      16
> language R
> 
> ______________________________________________
> R-devel at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

-- 
John M. Chambers                  jmc at bell-labs.com
Bell Labs, Lucent Technologies    office: (908)582-2681
700 Mountain Avenue, Room 2C-282  fax:    (908)582-3340
Murray Hill, NJ  07974            web: http://www.cs.bell-labs.com/~jmc