[Rd] modifying large R objects in place

Fri Sep 28 15:14:45 CEST 2007

On Fri, 28 Sep 2007, Petr Savicky wrote:

> On Fri, Sep 28, 2007 at 12:39:30AM +0200, Peter Dalgaard wrote:
> [...]
>>> nrow <- function(...) dim(...)[1]
>>> ncol <- function(...) dim(...)[2]
>>>
>>> At least in my environment, the new versions preserved NAMED == 1.

I believe this is a bug in the evaluation of ... arguments.  THe
intent in the code is I believe that all promise evaluations result in
NAMED==2 for safety.  That may be overly conservative but I would not
want to change it without some very careful thought -- I prefer to
wait a little longer for the right answer than to get a wrong one
quickly.

>>>
>> Yes, but changing the formal arguments is a bit messy, is it not?
>
> Specifically for nrow, ncol, I think not much, since almost nobody needs
> to know (or even knows) that the name of the formal argument is "x".
>
> However, there is another argument against the ... solution: it solves
> the problem only in the simplest cases like nrow, ncol, but is not
> usable in other, like colSums, rowSums. These functions also increase
> NAMED of its argument, although their output does not contain any
> reference to the original content of their arguments.
>
> I think that a systematic solution of this problem may be helpful.
> However, making these functions Internal or Primitive would
> not be good in my opinion. It is advantageous that these functions
> contain an R level part, which
> makes the basic decisions before a call to .Internal.
> If nothing else, this serves as a sort of documentation.
>
> For my purposes, I replaced calls to "colSums" and "matrix" by the
> corresponding calls to .Internal in my script. The result is that
> now I can complete several runs of my calculation in a cycle instead
> of restarting R after each of the runs.
>
> This leads me to a question. Some of the tests, which I did, suggest
> that gc() may not free all the memory, even if I remove all data
> objects by rm() before calling gc(). Is this possible or I must have
> missed something?

Not impossible but very unlikely givent he use gc gets. There are a
few internal tables that are grown but not shrunk at the moment but
that should not usually cause much total growth.  If you are ooking at
system memopry use then that is a malloc issue -- there was a thread
about this a month or so ago.

> A possible solution to the unwanted increase of NAMED due to temporary
> calculations could be to give the user the possibility
> to store NAMED attribute of an object before a call to a function
> and restore it after the call. To use this, the user should be
> confident that no new reference to the object persists after the
> function is completed.

This would be too dangerous for general use. Some more structured
approach may be possible. A related issue is that user-defined
assignment functions always see a NAMED of 2 and hence cannot modify
in place. We've been trying to come up with a reasonable solution to
this, so far without success but I'm moderately hopeful.

>> Presumably, nrow <- function(x) eval.parent(substitute(dim(x)[1])) works
>> too, but if the gain is important enough to warrant that sort of
>> programming, you might as well make nrow a .Primitive.
>
> You are right. This indeed works.
>
>> Longer-term, I still have some hope for better reference counting, but
>> the semantics of environments make it really ugly -- an environment can
>> contain an object that contains the environment, a simple example being
>>
>> f <- function()
>>    g <- function() 0
>> f()
>>
>> At the end of f(), we should decide whether to destroy f's evaluation
>> environment. In the present example, what we need to be able to see is
>> that this would remove all refences to g and that the reference from g
>> to f can therefore be ignored.  Complete logic for sorting this out is
>> basically equivalent to a new garbage collector, and one can suspect
>> that applying the logic upon every function return is going to be
>> terribly inefficient. However, partial heuristics might apply.
>
> I have to say that I do not understand the example very much.
> What is the input and output of f? Is g inside only defined or
> also used?
>
> Let me ask the following question. I assume that gc() scans the whole
> memory and determines for each part of data, whether a reference
> to it still exists or not. In my understanding, this is equivalent to
> determine, whether NAMED of it may be dropped to zero or not.
> Structures, for which this succeeds are then removed. Am I right?
> If yes, is it possible during gc() to determine also cases,
> when NAMED may be dropped from 2 to 1? How much would this increase
> the complexity of gc()?

Probably not impossible but would be a fair bit of work with probably
not much gain as the NAMED values would still be high until the next
gc of the appropriate level, which will probably be a fair time as an
object being modified is likely to be older, but the interval in which
there would be a benefit is short.

The basic functional model that underlies having the illuison of
non-modifyable vector data does not fit all that well with an
imperative style of modifying things in loops. It might be useful to
bring in some constructs from functional programming that are designed
to allow in-place modification to coexist with functional semantics.
Probably a longer term issue.

For now there are limits to what we can reasonable, and maintainably,
do in an interpreted R.  Having full reference counts might help but
might not because of other costs involved (significant increases in
cache misses in particular) but in any case it would probably be
easier to rewrite R from scratch than to retro-fit full reference
cunting to what we have so I an't see it happening real soon. Also it
doesn't help with many things, like user-level assignment: there
really are two references at the key point in that case.  With
compilation it may be possible to do some memory use analysis and work
out when it is safe to do destructive modification, but that is a fair
way off as well.

Best,

luke

>
> Thank you in advance for your kind reply.
>
> Petr Savicky.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu