[R] Linear models over large datasets

Greg Snow Greg.Snow at intermountainmail.org
Thu Aug 16 22:54:20 CEST 2007


Here are a couple of options that you could look at:

The biglm package also has the bigglm function which you only call once
(no update), you just need to give it a function that reads the data in
chunks for you.  Using bigglm with a gaussian family is equivalent to
lm.

You could also write your own function that calls biglm and the
necessary updates on it, then just call your function.

The SQLiteDF package has an sdflm function that uses the same internal
code as biglm, but based on having the data stored in an sqlite
database.  You don't need to call update with this function.

Hope this helps,

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111
 
 

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Alp ATICI
> Sent: Thursday, August 16, 2007 2:24 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Linear models over large datasets
> 
> I'd like to fit linear models on very large datasets. My data 
> frames are about 2000000 rows x 200 columns of doubles and I 
> am using an 64 bit build of R. I've googled about this 
> extensively and went over the "R Data Import/Export" guide. 
> My primary issue is although my data represented in ascii 
> form is 4Gb in size (therefore much smaller considered in 
> binary), R consumes about 12Gb of virtual memory.
> 
> What exactly are my options to improve this? I looked into 
> the biglm package but the problem with it is it uses update() 
> function and is therefore not transparent (I am using a 
> sophisticated script which is hard to modify). I really liked 
> the concept behind the  LM package
> here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html
> But it is no longer available. How could one fit linear 
> models to very large datasets without loading the entire set 
> into memory but from a file/database (possibly through a 
> connection) using a relatively simple modification of 
> standard lm()? Alternatively how could one improve the memory 
> usage of R given a large dataset (by changing some default 
> parameters of R or even using on-the-fly compression)? I 
> don't mind much higher levels of CPU time required.
> 
> Thank you in advance for your help.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list