[R] analysis of large data set

Matthew Keller mckellercran at gmail.com
Sat Nov 17 01:55:06 CET 2007


Spencer,

There have been a lot of discussions on these boards re working with
large datasets in R, so looking through those will probably inform you
better than I'll be able to. So with that said...

I have been trying to work with very large datasets as well (genetic
datasets... maybe we're in the same boat?). The long and short of it =
it is a challenge in R. Doable, but a challenge.

First, I'm guessing you're using 64 bit computing and a 64 bit version
of R, right? (I'm sure you're aware that you're capped by how much RAM
you can access with 32 bit computing). The error message is telling
you that R cannot find a contiguous bit of RAM that is that large
enough for whatever object it was trying to manipulate right before it
crashed. The total space taken up by your session was certainly much
greater than this.

How to avoid this problem? Short of reworking R to be more memory
efficient, you can buy more RAM, use a library designed to store
objects on hard drives rather than RAM (ff, filehash, or R.huge - I
personally have had best of luck with the latter), or use a library
designed to perform linear regression by using sparse matrices such as
t(X)*X rather than X (big.lm - haven't used this yet). I also have yet
to delve into the RSqlite library, which allows an interface between R
and the SQLite database system (thus, you only bring in the portion of
the database you need to work with).

If you're unwilling to do any of the above, the final option is to
read in only the part of the matrix you need, work with that portion
of it, and then remove it from memory. Slow but doable for most
things.

Oh yeah, I have found that frequent calls to gc() help out enormously,
regardless of what ?gc implies. And I'm constantly keeping an eye on
the top unix function (not sure what the equivalent is in windoze) to
check the RAM I'm taking up for a session. Best of luck!

Matt


On Nov 16, 2007 5:24 PM, sj <ssj1364 at gmail.com> wrote:
> All,
>
> I am working with a large data set (~ 450,000 rows by 34 columns) I am
> trying to fit a regression model (I have tried to use several procedures psm
> (Design package) lm, glm). However whenever I try to fit the model I get the
> following error:
>
>
> Error: cannot allocate vector of size 1.1 Gb
>
> Here are the specs of the machine and version of R I am using
>
> Windows Server 2003 R2 Enterprise x64 Service Pack 2
>
> Intel Pentium D 3.00 Ghz
> 3.93 GB Ram
>
> R 2.6.0
>
> when I type the command
>
> memory.limit()
> I get:
> 3583.875
>
> I assume that means that I have about 3.5 GB at my disposal  so I am
> confused why I can't allocate a vector of 1.1 GB. Any suggestions on what to
> do.
>
> Best,
>
> Spencer
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Matthew C Keller
Asst. Professor of Psychology
University of Colorado at Boulder
www.matthewckeller.com



More information about the R-help mailing list