[Rd] efficiency and memory use of S4 data objects

Gordon Smyth smyth at wehi.edu.au
Fri Aug 22 12:02:58 MEST 2003


Thanks for your thoughtful and considered response, as always. I think I 
need to make my position a little more clear.

I develop software in R which for the most part I'm happy with. For the 
most part my code seems to be correct, fast, reliable, useful for me and 
for other people. But it is mostly either S3 for not oop at all. I am under 
lots of pressure from people I respect and would like to cooperate with to 
convert my code to S4. I am not entirely happy about this because I believe 
that converting to S4 will substantially reduce the size of data set that 
my code can handle and will substantially increase overall execution times. 
(The example in my previous post was not sufficient in itself to prove 
this, but more about that below.) There are other issues such as how to 
document S4 methods and how to pass RCMD check, but I would like to focus 
on the efficiency issue here.

At 01:05 AM 22/08/2003, John Chambers wrote:
>The general question is certainly worth discussing, but I'd be surprised
>if your example is measuring what you think it is.
>
>The numeric computations are almost the only thing NOT radically changed
>between your two examples.  In the first,  you are applying a
>"primitive" set of functions ("+", "$", and "$<-") to a basic vector.
>These functions go directly to C code, without creating a context (aka
>frame) as would a call to an S-language function.  In the second
>example, the "+" will still be done in C code, with essentially no
>change since the arguments will still be basic vectors.
>
>Just about everything else, however, will be different.  If you really
>wanted to focus on the numeric computation, your second example would be
>more relevant with the loop being
>   for(i in 1:iter)object at x <- object at x+1
>In this case, the difference likely will be mainly the overhead for
>"@<-", which is not a primitive.  The example as written is adding a
>layer of functions and method dispatch.

I am sorry for giving the impression that I wanted to focus on the numeric 
computations. Of course it is the efficiency of the S4 classes and methods 
themselves that I am interested in. I deliberately chose an example which 
added a layer of functions and method dispatch, because that is what 
converting code to S4 does.

Here is another example with no user-defined methods:

 > system.time( structure(list(x=rep(1,10^7),class="MyS3Class")))
[1] 1.05 0.00 1.05   NA   NA
 > system.time( new("MyClass",x=rep(1,10^7)))
[1]  3.15  0.34 11.19    NA    NA

This seems to me to show that simply associating a formal S4 data class 
identity with a new object (no computation involved!) can increase the time 
required to create the object 11-fold compared with the S3 equivalent. 11 
seconds is a lot of time if you have a call to "new" at the end of every 
function in a large package, and some of these functions are called a very 
large number of times.

>But the lesson we've learned over many years (and with S generally, not
>specifically with methods and classes) is that empirical inference about
>efficiency is a subtle thing (somehow as statisticians you'd think we
>would expect that).  Artifical examples have to be very carefully
>designed and analysed before being taken at face value.

You seem to be suggesting that the effect might be an artifact of my 
particular artificial example, but all examples seem to be point in the 
same direction, i.e., that introducing S4 methods into code will slow it 
down. I can't construct any examples of S4 usage which are not at least 
slightly slower than the S3 equivalent. Can anyone else? I am not 
suggesting that I have identified the root cause any bottlenecks.

>R has some useful tools, especially Rprof, to look at examples in the
>hope of finding "hot spots".  It would be good to see some results,
>especially for realistic examples.

One can see plenty of realistic examples of S4 usage by trying the 
Bioconductor packages. But large realistic examples don't lend themselves 
easily to a post to r-devel.

>Anyway, on the general question.
>
>1.  Yes, there are lots of possibilities for speeding up method dispatch
>& hopefully these will get a chance to be tried out, after 1.8.  But I
>would caution people expecting that method dispatch is the hot spot
>_generally_.  On a couple of occasions, it was because of introduced
>glitches, and then the effect was obvious.  There are some indirect
>costs, such as creating a separate context for the method, and if these
>are shown to be an issue, something might be done.
>
>2. Memory use and the effect on garbage collection:  Not too much has
>been studied here & some good data would be helpful.  (Especially if
>some experts on storage management in R could offer advice.)
>
>3.  It might be more effective (and certainly more fun) to think of
>changes in the context of "modernizing" some of the computations in R
>generally.  There have been several suggestions discussed that in
>principle could speed up method/class computations, along with providing
>other new features.
>
>4. Meanwhile, the traditional S style that has worked well probably
>applies.  First, try out a variety of analyses taking advantage of
>high-level concepts to program quickly.  Then, when it's clear that
>something needs to be applied extensively, try to identify critical
>computations that could be mapped into lower-level versions (maybe even
>C code), getting efficiency by giving up flexibility.

I am already doing all computation-intensive operations in C or Fortran 
through appropriate use of R functions which themselves call C. I can't see 
how use of C can side-step the need to create S4 objects or to use S4 
methods in a package based on S4.

Regards
Gordon

>Regards,
>  John
>
>Gordon Smyth wrote:
> >
> > I do lots of analyses on large microarray data sets so memory use and speed
> > and both important issues for me. I have been trying to estimate the
> > overheads associated with using formal S4 data objects instead of ordinary
> > lists for large data objects. In some simple experiments (using R 1.7.1 in
> > Windows 2000) with large but simple objects it seems that giving a data
> > object a formal class definition and using extractor and assignment
> > functions may increase both memory usage and the time taken by simple
> > numeric operations by several fold.
> >
> > Here is a test function which uses a list representation to add 1 to the
> > elements of a long numeric vector:
> >
> > addlist <- function(len,iter) {
> >     object <- list(x=rnorm(len))
> >     for (i in 1:iter) object$x <- object$x+1
> >     object
> > }
> >
> > Typical times on my machine are:
> >
> >  > system.time(a <- addlist(10^6,10))
> > [1] 2.91 0.00 2.96   NA   NA
> >  > system.time(addlist(10^7,10))
> > [1] 28.03  0.44 28.65    NA    NA
> >
> > Here is a test function doing the same operation with a formal S4 data
> > representation:
> >
> > addS4 <- function(len,iter) {
> >    object <- new("MyClass",x=rnorm(len))
> >    for (i in 1:iter) x(object) <- x(object)+1
> >    object
> > }
> >
> > The timing with len=10^6 increases to
> >
> >  > system.time(a <- addS4(10^6,10))
> > [1] 6.79 0.06 6.90   NA   NA
> >
> > With len=10^7 the operation fails altogether due to insufficient memory
> > after thrashing around with virtual memory for a very long time.
> >
> > I guess I'm not surprised by the performance penalty with S4. My question
> > is: is the performance penalty likely to be an ongoing feature of using S4
> > methods or will it likely go away in future versions of R?
> >
> > Thanks
> > Gordon
> >
> > Here are my S4 definitions:
> >
> > setClass("MyClass",representation(x="numeric"))
> > setGeneric("x",function(object) standardGeneric("x"))
> > setMethod("x","MyClass",function(object) object at x)
> > setGeneric("x<-", function(object, value) standardGeneric("x<-"))
> > setReplaceMethod("x","MyClass",function(object,value) {object at x <- value;
> > return(object)})
> >
> >  > version
> >              _
> > platform i386-pc-mingw32
> > arch     i386
> > os       mingw32
> > system   i386, mingw32
> > status
> > major    1
> > minor    7.1
> > year     2003
> > month    06
> > day      16
> > language R
> >
> > ______________________________________________
> > R-devel at stat.math.ethz.ch mailing list
> > https://www.stat.math.ethz.ch/mailman/listinfo/r-devel
>
>--
>John M. Chambers                  jmc at bell-labs.com
>Bell Labs, Lucent Technologies    office: (908)582-2681
>700 Mountain Avenue, Room 2C-282  fax:    (908)582-3340
>Murray Hill, NJ  07974            web: http://www.cs.bell-labs.com/~jmc



More information about the R-devel mailing list