[Rd] efficiency and memory use of S4 data objects

Fri Aug 22 11:11:46 MEST 2003

General comments.

I didn't mean to criticize you at all, just to point out that making
inferences about what needs fixing from a few naked system times has
proven to be treacherous. 

Take your new example, with its "11-fold" increase:
>  > system.time( structure(list(x=rep(1,10^7),class="MyS3Class")))
> [1] 1.05 0.00 1.05   NA   NA
>  > system.time( new("MyClass",x=rep(1,10^7)))
> [1]  3.15  0.34 11.19    NA    NA

Aside from specific methods-related issues, the second example involves
more layers of function calls.  It's definitely an interesting issue for
efficiency, particularly if the difference is in extra copies of data. 
But where effort should be spent, if there are important improvements
possible, needs other information.

Also, the 11-fold is only in elapsed time.  The cpu time increase is a
little over 3 times.   Again, we don't have enough information, but a
guess is that the difference is in program size (given you're generating
a vector of ten million numbers) relative to the hardware and software
configuration of the machine.

Doing what I said we shouldn't do, I'll make a guess at the main
difference.  list() is a primitive, so I suspect it can make up its
value without extra copies.  Your example has "class" as a list element,
not an attribute (did you mean that?), so structure (the only
non-primitive extra in the first version) does essentially nothing.  The
new() call on the other hand does a fair bit of computation, in the
default method for initialize particularly.  It's likely the result is
to make some copies of the "x" slot, maybe needed maybe not.

It would be great to "fix" these inefficiencies, but the devil is in the
details.  We went through a similar exercise a number of years ago, of
course on a completely different code base.  The experience there was
that several person-months of effort were needed, studying performance
under a number of conditions.  In the end, substantial improvements were
made from roughly a dozen particular hot spots, as I recall.

So far, there seem to be two general areas to investigate: method
dispatch and the effect of added layers of S-language function calls,
particularly on memory size and copying.  I can imagine some substantial
improvements in both.

On the memory size issue, one general strategy worth investigating is to
intoduce what I would call "reference" class objects.  These would
inherit from a class "reference", which would signal internal R code to
treat the objects as references, not to be duplicated in the usual way. 
(There are some datatypes with this property already, but not usefully
extensible.)

Very large datasets, particularly of specialized types of data, could
benefit from class defintions extending "reference".  However, because
the semantics of these classes would be radically different from
ordinary data, the classes need to be carefully insulated from being
handed to arbitrary S-language functions as ordinary vectors, for
example.

Meanwhile, each question of converting existing code needs to balance
benefits, such as clearer design and more understandable software,
against conversion effort and possible extra computations.   As I've
said before, one appealing strategy to me at least is to plan on
converting when there is a motivation for some significant design
improvements.  (An example, probably controversial, is the general
"statistical software for models" area.  There are many examples there
where formal classes and methods could be substantially more powerful
than the old "white book" code.  But personally I think that a
conversion needs to be part of a serious redesign of model software, a
major project.)

Regards,
 John

Gordon Smyth wrote:
> 
> Thanks for your thoughtful and considered response, as always. I think I
> need to make my position a little more clear.
> 
> I develop software in R which for the most part I'm happy with. For the
> most part my code seems to be correct, fast, reliable, useful for me and
> for other people. But it is mostly either S3 for not oop at all. I am under
> lots of pressure from people I respect and would like to cooperate with to
> convert my code to S4. I am not entirely happy about this because I believe
> that converting to S4 will substantially reduce the size of data set that
> my code can handle and will substantially increase overall execution times.
> (The example in my previous post was not sufficient in itself to prove
> this, but more about that below.) There are other issues such as how to
> document S4 methods and how to pass RCMD check, but I would like to focus
> on the efficiency issue here.
> 
> At 01:05 AM 22/08/2003, John Chambers wrote:
> >The general question is certainly worth discussing, but I'd be surprised
> >if your example is measuring what you think it is.
> >
> >The numeric computations are almost the only thing NOT radically changed
> >between your two examples.  In the first,  you are applying a
> >"primitive" set of functions ("+", "$", and "$<-") to a basic vector.
> >These functions go directly to C code, without creating a context (aka
> >frame) as would a call to an S-language function.  In the second
> >example, the "+" will still be done in C code, with essentially no
> >change since the arguments will still be basic vectors.
> >
> >Just about everything else, however, will be different.  If you really
> >wanted to focus on the numeric computation, your second example would be
> >more relevant with the loop being
> >   for(i in 1:iter)object at x <- object at x+1
> >In this case, the difference likely will be mainly the overhead for
> >"@<-", which is not a primitive.  The example as written is adding a
> >layer of functions and method dispatch.
> 
> I am sorry for giving the impression that I wanted to focus on the numeric
> computations. Of course it is the efficiency of the S4 classes and methods
> themselves that I am interested in. I deliberately chose an example which
> added a layer of functions and method dispatch, because that is what
> converting code to S4 does.
> 
> Here is another example with no user-defined methods:
> 
>  > system.time( structure(list(x=rep(1,10^7),class="MyS3Class")))
> [1] 1.05 0.00 1.05   NA   NA
>  > system.time( new("MyClass",x=rep(1,10^7)))
> [1]  3.15  0.34 11.19    NA    NA
> 
> This seems to me to show that simply associating a formal S4 data class
> identity with a new object (no computation involved!) can increase the time
> required to create the object 11-fold compared with the S3 equivalent. 11
> seconds is a lot of time if you have a call to "new" at the end of every
> function in a large package, and some of these functions are called a very
> large number of times.
> 
> >But the lesson we've learned over many years (and with S generally, not
> >specifically with methods and classes) is that empirical inference about
> >efficiency is a subtle thing (somehow as statisticians you'd think we
> >would expect that).  Artifical examples have to be very carefully
> >designed and analysed before being taken at face value.
> 
> You seem to be suggesting that the effect might be an artifact of my
> particular artificial example, but all examples seem to be point in the
> same direction, i.e., that introducing S4 methods into code will slow it
> down. I can't construct any examples of S4 usage which are not at least
> slightly slower than the S3 equivalent. Can anyone else? I am not
> suggesting that I have identified the root cause any bottlenecks.
> 
> >R has some useful tools, especially Rprof, to look at examples in the
> >hope of finding "hot spots".  It would be good to see some results,
> >especially for realistic examples.
> 
> One can see plenty of realistic examples of S4 usage by trying the
> Bioconductor packages. But large realistic examples don't lend themselves
> easily to a post to r-devel.
> 
> >Anyway, on the general question.
> >
> >1.  Yes, there are lots of possibilities for speeding up method dispatch
> >& hopefully these will get a chance to be tried out, after 1.8.  But I
> >would caution people expecting that method dispatch is the hot spot
> >_generally_.  On a couple of occasions, it was because of introduced
> >glitches, and then the effect was obvious.  There are some indirect
> >costs, such as creating a separate context for the method, and if these
> >are shown to be an issue, something might be done.
> >
> >2. Memory use and the effect on garbage collection:  Not too much has
> >been studied here & some good data would be helpful.  (Especially if
> >some experts on storage management in R could offer advice.)
> >
> >3.  It might be more effective (and certainly more fun) to think of
> >changes in the context of "modernizing" some of the computations in R
> >generally.  There have been several suggestions discussed that in
> >principle could speed up method/class computations, along with providing
> >other new features.
> >
> >4. Meanwhile, the traditional S style that has worked well probably
> >applies.  First, try out a variety of analyses taking advantage of
> >high-level concepts to program quickly.  Then, when it's clear that
> >something needs to be applied extensively, try to identify critical
> >computations that could be mapped into lower-level versions (maybe even
> >C code), getting efficiency by giving up flexibility.
> 
> I am already doing all computation-intensive operations in C or Fortran
> through appropriate use of R functions which themselves call C. I can't see
> how use of C can side-step the need to create S4 objects or to use S4
> methods in a package based on S4.
> 
> Regards
> Gordon
> 
> >Regards,
> >  John
..................

-- 
John M. Chambers                  jmc at bell-labs.com
Bell Labs, Lucent Technologies    office: (908)582-2681
700 Mountain Avenue, Room 2C-282  fax:    (908)582-3340
Murray Hill, NJ  07974            web: http://www.cs.bell-labs.com/~jmc