[Rd] efficiency and memory use of S4 data objects

Sun Aug 24 16:30:27 MEST 2003

Thanks for your comments again.  Your remarks on how performance problems
might be attacked help me understand where R is likely to go with this and
put me in a better position to make strategic decisions about my own code.

I would love to have a discussion, at an appropriate time, about
possibilities for redesign of model software.  It would also be
interesting to take part in a discussion about when formal
object-orientated constructs such as S4 do make code clearer and easier to
understand and when they don't.  I have some views on this, which I won't
inflict on the mailing list!

I can't resist contributing one more observation re empirical timings.  It
is interesting that the function defined by

test1 <- function() {
	y <- new("MyClass")
	y at x0 <- rep(1,10^6)
	y at x1 <- rep(1,10^6)
	y at x2 <- rep(1,10^6)
	y at x3 <- rep(1,10^6)
	y at x4 <- rep(1,10^6)
	y at x5 <- rep(1,10^6)
	y at x6 <- rep(1,10^6)
	y at x7 <- rep(1,10^6)
	y at x8 <- rep(1,10^6)
	y at x9 <- rep(1,10^6)
	y
}

executes much faster than

test2 <- function() {
	new("MyClass",
	x0 = rep(1,10^6),
	x1 = rep(1,10^6),
	x2 = rep(1,10^6),
	x3 = rep(1,10^6),
	x4 = rep(1,10^6),
	x5 = rep(1,10^6),
	x6 = rep(1,10^6),
	x7 = rep(1,10^6),
	x8 = rep(1,10^6),
	x9 = rep(1,10^6))
}

with class "MyClass" defined in the obvious way.  It seems to be better to
create an empty S4 class object and then set the slots rather than to
create the object all at once using "new".  This is opposite to the rule
with ordinary list components where it seems generally better to create
the entire list at once.

Regards
Gordon

> General comments.
>
> I didn't mean to criticize you at all, just to point out that making
> inferences about what needs fixing from a few naked system times has
> proven to be treacherous.
>
> Take your new example, with its "11-fold" increase:
>>  > system.time( structure(list(x=rep(1,10^7),class="MyS3Class")))
>> [1] 1.05 0.00 1.05   NA   NA
>>  > system.time( new("MyClass",x=rep(1,10^7)))
>> [1]  3.15  0.34 11.19    NA    NA
>
> Aside from specific methods-related issues, the second example involves
> more layers of function calls.  It's definitely an interesting issue for
> efficiency, particularly if the difference is in extra copies of data.
> But where effort should be spent, if there are important improvements
> possible, needs other information.

Absolutely.  System timings are the ultimate measure of the user
experience, but measuring the user experience and figuring out how to
improve it aren't the same thing.

> Also, the 11-fold is only in elapsed time.  The cpu time increase is a
> little over 3 times.   Again, we don't have enough information, but a
> guess is that the difference is in program size (given you're generating
> a vector of ten million numbers) relative to the hardware and software
> configuration of the machine.
>
> Doing what I said we shouldn't do, I'll make a guess at the main
> difference.  list() is a primitive, so I suspect it can make up its
> value without extra copies.  Your example has "class" as a list element,
> not an attribute (did you mean that?),

No, sorry, my mistake - a bracket in the wrong place.  If I put the
bracket where it should be, so that a 'class' attribute is attached to the
object, then the S3 code is slower than before but still quicker than the
S4.

> so structure (the only
> non-primitive extra in the first version) does essentially nothing.  The
> new() call on the other hand does a fair bit of computation, in the
> default method for initialize particularly.  It's likely the result is
> to make some copies of the "x" slot, maybe needed maybe not.
>
> It would be great to "fix" these inefficiencies, but the devil is in the
> details.  We went through a similar exercise a number of years ago, of
> course on a completely different code base.  The experience there was
> that several person-months of effort were needed, studying performance
> under a number of conditions.  In the end, substantial improvements were
> made from roughly a dozen particular hot spots, as I recall.
>
> So far, there seem to be two general areas to investigate: method
> dispatch and the effect of added layers of S-language function calls,
> particularly on memory size and copying.  I can imagine some substantial
> improvements in both.
>
> On the memory size issue, one general strategy worth investigating is to
> intoduce what I would call "reference" class objects.  These would
> inherit from a class "reference", which would signal internal R code to
> treat the objects as references, not to be duplicated in the usual way.
> (There are some datatypes with this property already, but not usefully
> extensible.)
>
> Very large datasets, particularly of specialized types of data, could
> benefit from class defintions extending "reference".  However, because
> the semantics of these classes would be radically different from
> ordinary data, the classes need to be carefully insulated from being
> handed to arbitrary S-language functions as ordinary vectors, for
> example.
>
> Meanwhile, each question of converting existing code needs to balance
> benefits, such as clearer design and more understandable software,
> against conversion effort and possible extra computations.   As I've
> said before, one appealing strategy to me at least is to plan on
> converting when there is a motivation for some significant design
> improvements.  (An example, probably controversial, is the general
> "statistical software for models" area.  There are many examples there
> where formal classes and methods could be substantially more powerful
> than the old "white book" code.  But personally I think that a
> conversion needs to be part of a serious redesign of model software, a
> major project.)
>
> Regards,
>  John
>
> Gordon Smyth wrote:
>>
>> Thanks for your thoughtful and considered response, as always. I think
>> I need to make my position a little more clear.
>>
>> I develop software in R which for the most part I'm happy with. For
>> the most part my code seems to be correct, fast, reliable, useful for
>> me and for other people. But it is mostly either S3 for not oop at
>> all. I am under lots of pressure from people I respect and would like
>> to cooperate with to convert my code to S4. I am not entirely happy
>> about this because I believe that converting to S4 will substantially
>> reduce the size of data set that my code can handle and will
>> substantially increase overall execution times. (The example in my
>> previous post was not sufficient in itself to prove this, but more
>> about that below.) There are other issues such as how to document S4
>> methods and how to pass RCMD check, but I would like to focus on the
>> efficiency issue here.
>>
>> At 01:05 AM 22/08/2003, John Chambers wrote:
>> >The general question is certainly worth discussing, but I'd be
>> surprised if your example is measuring what you think it is.
>> >
>> >The numeric computations are almost the only thing NOT radically
>> changed between your two examples.  In the first,  you are applying a
>> >"primitive" set of functions ("+", "$", and "$<-") to a basic vector.
>> These functions go directly to C code, without creating a context
>> (aka frame) as would a call to an S-language function.  In the second
>> example, the "+" will still be done in C code, with essentially no
>> change since the arguments will still be basic vectors.
>> >
>> >Just about everything else, however, will be different.  If you
>> really wanted to focus on the numeric computation, your second
>> example would be more relevant with the loop being
>> >   for(i in 1:iter)object at x <- object at x+1
>> >In this case, the difference likely will be mainly the overhead for
>> "@<-", which is not a primitive.  The example as written is adding a
>> layer of functions and method dispatch.
>>
>> I am sorry for giving the impression that I wanted to focus on the
>> numeric computations. Of course it is the efficiency of the S4 classes
>> and methods themselves that I am interested in. I deliberately chose
>> an example which added a layer of functions and method dispatch,
>> because that is what converting code to S4 does.
>>
>> Here is another example with no user-defined methods:
>>
>>  > system.time( structure(list(x=rep(1,10^7),class="MyS3Class")))
>> [1] 1.05 0.00 1.05   NA   NA
>>  > system.time( new("MyClass",x=rep(1,10^7)))
>> [1]  3.15  0.34 11.19    NA    NA
>>
>> This seems to me to show that simply associating a formal S4 data
>> class identity with a new object (no computation involved!) can
>> increase the time required to create the object 11-fold compared with
>> the S3 equivalent. 11 seconds is a lot of time if you have a call to
>> "new" at the end of every function in a large package, and some of
>> these functions are called a very large number of times.
>>
>> >But the lesson we've learned over many years (and with S generally,
>> not specifically with methods and classes) is that empirical
>> inference about efficiency is a subtle thing (somehow as
>> statisticians you'd think we would expect that).  Artifical examples
>> have to be very carefully designed and analysed before being taken at
>> face value.
>>
>> You seem to be suggesting that the effect might be an artifact of my
>> particular artificial example, but all examples seem to be point in
>> the same direction, i.e., that introducing S4 methods into code will
>> slow it down. I can't construct any examples of S4 usage which are not
>> at least slightly slower than the S3 equivalent. Can anyone else? I am
>> not suggesting that I have identified the root cause any bottlenecks.
>>
>> >R has some useful tools, especially Rprof, to look at examples in the
>> hope of finding "hot spots".  It would be good to see some results,
>> especially for realistic examples.
>>
>> One can see plenty of realistic examples of S4 usage by trying the
>> Bioconductor packages. But large realistic examples don't lend
>> themselves easily to a post to r-devel.
>>
>> >Anyway, on the general question.
>> >
>> >1.  Yes, there are lots of possibilities for speeding up method
>> dispatch & hopefully these will get a chance to be tried out, after
>> 1.8.  But I would caution people expecting that method dispatch is
>> the hot spot _generally_.  On a couple of occasions, it was because
>> of introduced glitches, and then the effect was obvious.  There are
>> some indirect costs, such as creating a separate context for the
>> method, and if these are shown to be an issue, something might be
>> done.
>> >
>> >2. Memory use and the effect on garbage collection:  Not too much has
>> been studied here & some good data would be helpful.  (Especially if
>> some experts on storage management in R could offer advice.)
>> >
>> >3.  It might be more effective (and certainly more fun) to think of
>> changes in the context of "modernizing" some of the computations in R
>> generally.  There have been several suggestions discussed that in
>> principle could speed up method/class computations, along with
>> providing other new features.
>> >
>> >4. Meanwhile, the traditional S style that has worked well probably
>> applies.  First, try out a variety of analyses taking advantage of
>> high-level concepts to program quickly.  Then, when it's clear that
>> something needs to be applied extensively, try to identify critical
>> computations that could be mapped into lower-level versions (maybe
>> even C code), getting efficiency by giving up flexibility.
>>
>> I am already doing all computation-intensive operations in C or
>> Fortran through appropriate use of R functions which themselves call
>> C. I can't see how use of C can side-step the need to create S4
>> objects or to use S4 methods in a package based on S4.
>>
>> Regards
>> Gordon
>>
>> >Regards,
>> >  John
> ..................
>
> --
> John M. Chambers                  jmc at bell-labs.com
> Bell Labs, Lucent Technologies    office: (908)582-2681
> 700 Mountain Avenue, Room 2C-282  fax:    (908)582-3340
> Murray Hill, NJ  07974            web: http://www.cs.bell-labs.com/~jmc