[R] lean and mean lm/glm?
damien.moore at excite.com
Tue Aug 22 01:44:46 CEST 2006
>For very large regression problems there is the biglm package (put you
>data into a database, read in 500,000 rows at a time, and keep updating
thanks. I took a look at biglm and it seems pretty easy to use and, looking at the source, avoids much of the redundancy of lm. Correct me if i'm wrong, but I think it would be virtually impossible to extend to glm, because of the non-linearity in glm models.
I might hack around at the source code for glm.fit -- I think I can avoid some of the redundancy involved in that routine pretty easily, but it will mean rewriting the summary output code...
--- On Mon 08/21, Greg Snow < Greg.Snow at intermountainmail.org > wrote:From: Greg Snow [mailto: Greg.Snow at intermountainmail.org]To: damien.moore at excite.com, r-help at stat.math.ethz.chDate: Mon, 21 Aug 2006 12:01:06 -0600Subject: RE: [R] lean and mean lm/glm?
For very large regression problems there is the biglm package (put you
data into a database, read in 500,000 rows at a time, and keep updating
This has not been extended to glm yet.
Hope this helps,
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
greg.snow at intermountainmail.org
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Damien Moore
Sent: Monday, August 21, 2006 11:49 AM
To: r-help at stat.math.ethz.ch
Subject: [R] lean and mean lm/glm?
Hi All: I'm new to R and have a few questions about getting R to run
efficiently with large datasets.
I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after the
usual windows overhead). I have a dataset that has 4 million
observations and about 20 variables. I want to run probit regressions on
this data, but can't do this with more than about 500,000 observations
before I start running out of ram (you could argue that I'm getting
sufficient precision with <500,000 obs but lets pretend otherwise).
Loading 500,000 observations into a data frame only takes about 100Mb of
ram, so that isn't the problem. Instead it seems R uses huge amount of
memory when running the glm methods. I called the Fortran routines that
lm and glm use directly but even they create a large number of
extraneous variables in the output (e.g. the Xs, ys, residuals etc) and
during processing. For instance (sample code)
y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the
7mb of ram required to store y during processing, but cleans up ok
afterwards with a gc() call
#this is the fortran routine called by lm - regressing y on X here z <-
.Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, tol =
as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y, effects
= y, rank = integer(1), pivot = 1:p, qraux = double(p), work = double(2
* p), PACKAGE = "base")
This code runs very quickly - suggesting that in principle R should have
no problem at all handling very large data sets, but uses >100mb during
processing and z is about a 20mb object. Scaling this up to a much
larger dataset with many variables its easy to see i'm going to run into
1. are there any memory efficient alternatives to lm/glm in R?
2. is there any way to prevent the Fortran routine "dqrls" from
producing so much output? (I suspect not since its output has to be
compatible with the summary method, which seems to rely on having a copy
of all variables instead of just references to the relevant variables -
correct me if i'm wrong on this) 3. failing 1 & 2 how easy would it be
to create new versions of lm and glm that don't use so much memory? (Not
that I'm volunteering or anything ;) ). There is no need to hold
individual residuals in memory or make copies of the variables (at least
for my purposes). How well documented is the source code?
R-help at stat.math.ethz.ch mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help