[R] Linear models over large datasets

Alp ATICI alpatici at gmail.com
Thu Aug 16 22:24:08 CEST 2007


I'd like to fit linear models on very large datasets. My data frames
are about 2000000 rows x 200 columns of doubles and I am using an 64
bit build of R. I've googled about this extensively and went over the
"R Data Import/Export" guide. My primary issue is although my data
represented in ascii form is 4Gb in size (therefore much smaller
considered in binary), R consumes about 12Gb of virtual memory.

What exactly are my options to improve this? I looked into the biglm
package but the problem with it is it uses update() function and is
therefore not transparent (I am using a sophisticated script which is
hard to modify). I really liked the concept behind the  LM package
here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html
But it is no longer available. How could one fit linear models to very
large datasets without loading the entire set into memory but from a
file/database (possibly through a connection) using a relatively
simple modification of standard lm()? Alternatively how could one
improve the memory usage of R given a large dataset (by changing some
default parameters of R or even using on-the-fly compression)? I don't
mind much higher levels of CPU time required.

Thank you in advance for your help.



More information about the R-help mailing list