[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory

Fri Aug 10 09:34:54 CEST 2007

Thanks for all the comments,

The artificial dataset is as representative of my 440MB file as I could design.

I did my best to reduce the complexity of my problem to minimal
reproducible code as suggested in the posting guidelines.  Having
searched the archives, I was happy to find that the topic had been
covered, where Prof Ripley suggested that the I/O manuals gave some
advice.  However, I was unable to get anywhere with the I/O manuals
advice.

I spent 6 hours preparing my post to R-help. Sorry not to have read
the 'R-Internals' manual.  I just wanted to know if I could use scan()
more efficiently.

My hurdle seems nothing to do with efficiently calling scan() .  I
suspect the same is true for the originator of this memory experiment
thread. It is the overhead of storing short strings, as Charles
identified and Brian explained.  I appreciate the investigation and
clarification you both have made.

56B overhead for a 2 character string seems extreme to me, but I'm not
complaining. I really like R, and being free, accept that
it-is-what-it-is.

In my case pre-processing is not an option, it is not a one off
problem with a particular file. In my application, R is run in batch
mode as part of a tool chain for arbitrary csv files.  Having found
cases where memory usage was as high as 20x file size, and allowing
for a copy of the the loaded dataset, I'll just need to document that
it is possible that files as small as 1/40th of system memory may
consume it all.  That rules out some important datasets (US Census, UK
Office of National Statistics files, etc) for 2GB servers.

Regards, Mike

On 8/9/07, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
> On Thu, 9 Aug 2007, Charles C. Berry wrote:
>
> > On Thu, 9 Aug 2007, Michael Cassin wrote:
> >
> >> I really appreciate the advice and this database solution will be useful to
> >> me for other problems, but in this case I  need to address the specific
> >> problem of scan and read.* using so much memory.
> >>
> >> Is this expected behaviour?
>
> Yes, and documented in the 'R Internals' manual.  That is basic reading
> for people wishing to comment on efficiency issues in R.
>
> >> Can the memory usage be explained, and can it be
> >> made more efficient?  For what it's worth, I'd be glad to try to help if the
> >> code for scan is considered to be worth reviewing.
> >
> > Mike,
> >
> > This does not seem to be an issue with scan() per se.
> >
> > Notice the difference in size of big2, big3, and bigThree here:
> >
> >> big2 <- rep(letters,length=1e6)
> >> object.size(big2)/1e6
> > [1] 4.000856
> >> big3 <- paste(big2,big2,sep='')
> >> object.size(big3)/1e6
> > [1] 36.00002
>
> On a 32-bit computer every R object has an overhead of 24 or 28 bytes.
> Character strings are R objects, but in some functions such as rep (and
> scan for up to 10,000 distinct strings) the objects can be shared.  More
> string objects will be shared in 2.6.0 (but factors are designed to be
> efficient at storing character vectors with few values).
>
> On a 64-bit computer the overhead is usually double.  So I would expect
> just over 56 bytes/string for distinct short strings (and that is what
> big3 gives).
>
> But 56Mb is really not very much (tiny on a 64-bit computer), and 1
> million items is a lot.
>
> [...]
>
>
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>