[Rd] Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?

Sun May 26 02:10:59 CEST 2013

On Sat, May 25, 2013 at 4:38 PM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
> On May 25, 2013, at 3:48 PM, Henrik Bengtsson wrote:
>
>> Hi,
>>
>> in my packages/functions/code I tend to remove large temporary
>> variables as soon as possible, e.g. large intermediate vectors used in
>> iterations.  I sometimes also have the habit of doing this to make it
>> explicit in the source code when a temporary object is no longer
>> needed.  However, I did notice that this can add a noticeable overhead
>> when the rest of the iteration step does not take that much time.
>>
>> Trying to speed this up, I first noticed that rm(list="a") is much
>> faster than rm(a).  While at it, I realized that for the purpose of
>> keeping the memory footprint small, I can equally well reassign the
>> variable the value of a small object (e.g. a <- NULL), which is
>> significantly faster than using rm().
>>
>
> Yes, as you probably noticed rm() is a quite complex function because it has to deal with different ways to specify input etc.
> When you remove that overhead (by calling .Internal(remove("a", parent.frame(), FALSE))), you get the same performance as the assignment.
> If you really want to go overboard, you can define your own function:
>
> SEXP rm(SEXP x, SEXP rho) { setVar(x, R_UnboundValue, rho); return R_NilValue; }
> poof <- function(x) .Call(rm_C, substitute(x), parent.frame())
>
> That will be faster than anything else (mainly because it avoids the trip through strings as it can use the symbol directly).

Thanks for this one.  This is useful - I did try to follow where
.Internal(remove, ...), but got lost in the internal structures.

Of course, I'd love to see such a function in 'base' itself.  Having
such a well defined and narrow function for removing a variable in the
current environment may also be useful for 'codetools'/'R CMD check'
such that code inspection can detect undefined variables in the case
they used to be defined but later have been removed.  Technically rm()
allows for that too, but I can see how such a task quickly gets
complicated when arguments 'list', 'envir' and 'inherits' are
involved.

>
> But as Bill noted - it practice I'd recommend using either local() or functions to control the scope - using rm() or assignments seems too error-prone to me.

I didn't mention it, but another reason I use rm() a lot is actually
so R can catch my programming mistakes (I'm maintaining 100,000+ lines
of code), i.e. the opposite to being error prone.  For instance, by
doing rm(tmp) as soon as possible, R will give me the run-time error
"Error: object 'tmp' not found" in case I use it by mistake later on.
As said above, potential the codetools/'R CMD check' will be able to
detect this already at check time [above].  With tmp <- NULL I'll
loose a bit of this protection, although another run-time error is
likely to occur a bit later.

Using local()/local functions are obviously alternatives for the above.

Thanks both (and sorry about the game - though it was an entertaining one)

/Henrik

>
> Cheers,
> Simon
>
>
>
>> SOME BENCHMARKS:
>> A toy example imitating an iterative algorithm with "large" temporary objects.
>>
>> x <- matrix(rnorm(100e6), ncol=10e3)
>>
>> t1 <- system.time(for (k in 1:ncol(x)) {
>>  a <- x[,k]
>>  colSum <- sum(a)
>>  rm(a) # Not needed anymore
>>  b <- x[k,]
>>  rowSum <- sum(b)
>>  rm(b) # Not needed anymore
>> })
>>
>> t2 <- system.time(for (k in 1:ncol(x)) {
>>  a <- x[,k]
>>  colSum <- sum(a)
>>  rm(list="a") # Not needed anymore
>>  b <- x[k,]
>>  rowSum <- sum(b)
>>  rm(list="b") # Not needed anymore
>> })
>>
>> t3 <- system.time(for (k in 1:ncol(x)) {
>>  a <- x[,k]
>>  colSum <- sum(a)
>>  a <- NULL # Not needed anymore
>>  b <- x[k,]
>>  rowSum <- sum(b)
>>  b <- NULL # Not needed anymore
>> })
>>
>>> t1
>>   user  system elapsed
>>   8.03    0.00    8.08
>>> t1/t2
>>    user   system  elapsed
>> 1.322900 0.000000 1.320261
>>> t1/t3
>>    user   system  elapsed
>> 1.715812 0.000000 1.662551
>>
>>
>> Is there a reason why I shouldn't assign NULL instead of using rm()?
>> As far as I understand it, the garbage collector will be equally
>> efficient cleaning out the previous object when using rm(a) or a <-
>> NULL.  Is there anything else I'm overlooking?  Am I adding overhead
>> somewhere else?
>>
>> /Henrik
>>
>>
>> PS. With the above toy example one can obviously be a bit smarter by using:
>>
>> t4 <- system.time({for (k in 1:ncol(x)) {
>>  a <- x[,k]
>>  colSum <- sum(a)
>>  a <- x[k,]
>>  rowSum <- sum(a)
>> }
>> rm(list="a")
>> })
>>
>> but that's not my point.
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>