[Rd] Suggestion for memory optimization and as.double() with friends

Prof Brian Ripley ripley at stats.ox.ac.uk
Thu Mar 29 04:11:18 CEST 2007


On Wed, 28 Mar 2007, Duncan Murdoch wrote:

> On 3/28/2007 5:25 PM, Henrik Bengtsson wrote:
>> Hi,
>>
>> when doing as.double() on an object that is already a double, the
>> object seems to be copied internally, doubling the memory requirement.
>>  See example below.  Same for as.character() etc.  Is this intended?
>>
>> Example:
>>
>> % R --vanilla
>>> x <- double(1e7)
>>> gc()
>>            used (Mb) gc trigger (Mb) max used (Mb)
>> Ncells   234019  6.3     467875 12.5   350000  9.4
>> Vcells 10103774 77.1   11476770 87.6 10104223 77.1
>>> x <- as.double(x)
>>> gc()
>>            used (Mb) gc trigger  (Mb) max used  (Mb)
>> Ncells   234113  6.3     467875  12.5   350000   9.4
>> Vcells 10103790 77.1   21354156 163.0 20103818 153.4
>>
>> However, couldn't this easily be avoided by letting as.double() return
>> the object as is if already a double?
>
> as.double calls the internal as.vector, which also strips off
> attributes.  But in the case where the output is identical to the input,
> this does seem like an easy optimization.  I don't know if it would help
> most people, but it might help in the kinds of cases you mention.

The cases mentioned are going to copy going into .Fortran and back out 
from .Fortran, so saving one copy will not be a big gain.  The (known) 
problem is using .C/.Fortran for large vectors, not as.double, and in the 
smooth.spline example the vectors will not be large in the intended usage.

The usual 'trick' to avoid this copy is

storage.mode(x) <- "double"

if you don't care about stripping attributes.

I have looked at this (internal) optimization before and not found any 
real-life problems where it seemed important.  People should expect to 
profile a real example and find as.vector taking an appreciable part of 
the time *before* spending developer time on speculative optimizations.

>
> Duncan Murdoch
>
>>
>> Example:
>>
>> % R --vanilla
>>> as.double.double <- function(x, ...) x
>>> x <- double(1e7)
>>> gc()
>>            used (Mb) gc trigger (Mb) max used (Mb)
>> Ncells   234019  6.3     467875 12.5   350000  9.4
>> Vcells 10103774 77.1   11476770 87.6 10104223 77.1
>>> x <- as.double(x)
>>> gc()
>>            used (Mb) gc trigger (Mb) max used (Mb)
>> Ncells   234028  6.3     467875 12.5   350000  9.4
>> Vcells 10103779 77.1   12130608 92.6 10104223 77.1
>>
>> What's the catch?
>>
>>
>> The reason why I bring it up, is because many (most?) methods are
>> using as.double() etc "just in case" when passing arguments to
>> .Call(), .Fortran() etc, e.g. stats::smooth.spline():
>>
>>     fit <- .Fortran(R_qsbart, as.double(penalty), as.double(dofoff),
>>         x = as.double(xbar), y = as.double(ybar), w = as.double(wbar), <etc>)
>>
>> Your memory usage is peaking in the actual call and the garbage
>> collector cannot clean it up until after the call. This seems to be
>> waste of memory, especially when the objects are large (100-1000MBs).
>>
>> Cheers
>>
>> Henrik
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list