[R] error loading huge .RData

Luke Tierney luke at stat.umn.edu
Thu Apr 25 18:13:16 CEST 2002


On Tue, Apr 23, 2002 at 11:36:48AM -0400, Liaw, Andy wrote:
> Dear R-help,
> 
> I've run into a problem loading .RData:  I was running a large computation,
> which supposedly produce a large R object.  At the end of the session, I did
> a save.image() and then quit.  The .RData has size 613,249,399 bytes.  Now I
> can't get R to load this .RData file.  Whenever I tried, I get "Error:
> vector memory exhausted (limit reached)".  I tried adding
> "--min-vsize=1000M", but that didn't help.  I also tried R  --vanilla and
> then attach(".RData"), same error.
> 
> >From what I can see, the file is not corrupted.  How can I get R to load it?
> 
> System info:
> R-1.4.1 on Mandrake Linux 7.1 (kernel 2.4.3)
> Dual P3-866 Xeon with 2GB RAM.
> 
> Regards,
> Andy
> 

Andy Liaw indicated privately that he decided to upgrade his Linux and
the problem went away (because glibc 2.2's malloc behaves by default
differently than earlier ones).

In tracking this down I learned a little more about address space use
on Linux than I wanted to know.  But as this may come up again I'll
report it for future reference.  The details are probably not quite
right, but I think the big picture is.  It may be worth knowing if you
want to use large amounts of memory on 32-bit Linux, or any other
32-bit OS--the details given here are specific to Linux, but the
general issue is not: Trying to carve out room for one or two
gigabytes of memory from a 32-bit address space, which is limited to
4G, is tricky.

The details: In 32-bit Linux you have a 4G address space.  The lower
3G of this is available for user mode.  The bottom contains program
text, data, and bss segments followed by the heap.  The heap, which is
what malloc traditionally uses, is a contiguous range of addresses
that goes up to a point that can be adjusted with the brk system
call. The brk can be adjusted to increase heap size as long as the
resulting range does not intersect any range that has been used for
memory mapping.

Shared libraries are loaded by memory mapping their data, text and bss
sections with mmap.  You can find out what is mapped where by looking
at /proc/<your process pid>/maps.  Every Linux program needs ld.so,
and that is always mapped to a range of addresses starting at
0x40000000.  [This is configurable when you build a custom kernel, and
it may now or soon be adjustable to some degree at boot or run time,
but this is the default.]  So the heap can grow at most to this point,
which means the contiguous heap is limited to a size of a little under
1G, no matter how much swap space or memory you have.

glibc malloc can either allocate only from the traditional contiguous
heap, which implies a 1G max on total allocation, or it can also
allocate using mmap, which allows it to use closer to the full 3G of
user mode address space.  The drawbacks of using mmap include a speed
penalty, at least by some measurements, and fragmentation of the
address space available for memory mapping large files.  Whether and
how glibc uses mmap is tunable by calling mallopt or by setting the
environment variables MALLOC_MMAP_THRESHOLD_ and MALLOC_MMAP_MAX_.
The default behavior seems to have changed between glibc 2.1 and 2.2,
and may have changed again with a patch to 2.2.  By default, 2.1 seems
to only use mmap for allocations of size above the mmap threshold
(which I think defaults to 128K), but glibc 2.2 will also use mmap for
smaller allocations if it can't get them from the standard heap.

So, because of the ld.so mapping, 1G is something of a critical
threshold.  In this particular example with the large .RData file, it
would appear that the malloc used was not using mmap when the
contiguous heap runs out.  The process that created the file was
probably quite close to the limit of 1G but did not exceed it, so it
did not fail.  Loading the file would push the required memory over
the limit, and so could not be done with the standard malloc settings
on this system.  Upgrading to a newer glibc changed the default
behavior and now the file can be loaded.  With the old libc it might
have been possible to get the file loaded by using environment
variable settings something like

    MALLOC_MMAP_THRESHOLD_=2000
    MALLOC_MMAP_MAX_=1000000

for the R process, though these may not be the safest choices and may
degrade performance.  But how much useful work one can do with this
high a portion of the address space in use is not entirely clear.

64-bit systems are starting to look real attractive.

A couple of references:

http://www.linuxshowcase.org/full_papers/ezolt/ezolt.pdf
http://mail.nl.linux.org/linux-mm/2000-07/msg00001.html
http://www.linux-mag.com/2001-06/compile_01.html
http://www.linux-mag.com/2001-07/compile_01.html
http://www.gnu.org/manual/glibc-2.2.3
http://sources.redhat.com/ml/libc-hacker/2000-07/msg00273.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0101.1/0007.html

<R source root>/src/gnuwin32/malloc.c

luke


-- 
Luke Tierney
University of Minnesota                      Phone:           612-625-7843
School of Statistics                         Fax:             612-624-8868
313 Ford Hall, 224 Church St. S.E.           email:      luke at stat.umn.edu
Minneapolis, MN 55455 USA                    WWW:  http://www.stat.umn.edu
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list