[R] lean and mean lm/glm?

Mon Aug 21 20:01:06 CEST 2006

For very large regression problems there is the biglm package (put you
data into a database, read in 500,000 rows at a time, and keep updating
the fit).

This has not been extended to glm yet.

Hope this helps, 

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Damien Moore
Sent: Monday, August 21, 2006 11:49 AM
To: r-help at stat.math.ethz.ch
Subject: [R] lean and mean lm/glm?

Hi All: I'm new to R and have a few questions about getting R to run
efficiently with large datasets.

I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after the
usual windows overhead). I have a dataset that has 4 million
observations and about 20 variables. I want to run probit regressions on
this data, but can't do this with more than about 500,000 observations
before I start running out of ram (you could argue that I'm getting
sufficient precision with <500,000 obs but lets pretend otherwise).
Loading 500,000 observations into a data frame only takes about 100Mb of
ram, so that isn't the problem. Instead it seems R uses huge amount of
memory when running the glm methods. I called the Fortran routines that
lm and glm use directly but even they create a large number of
extraneous variables in the output (e.g. the Xs, ys, residuals etc) and
during processing. For instance (sample code)

x=runif(1000000)
y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the
7mb of ram required to store y during processing, but cleans up ok
afterwards with a gc() call
X=cbind(x)
p=ncol(X)
n=NROW(y)
ny=NCOL(y)
tol=1e-7
#this is the fortran routine called by lm - regressing y on X here z <-
.Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, tol =
as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y, effects
= y, rank = integer(1), pivot = 1:p, qraux = double(p), work = double(2
* p), PACKAGE = "base")

This code runs very quickly - suggesting that in principle R should have
no problem at all handling very large data sets, but uses >100mb during
processing and z is about a 20mb object. Scaling this up to a much
larger dataset with many variables its easy to see i'm going to run into
problems

My questions:
1. are there any memory efficient alternatives to lm/glm in R?
2. is there any way to prevent the Fortran routine "dqrls" from
producing so much output? (I suspect not since its output has to be
compatible with the summary method, which seems to rely on having a copy
of all variables instead of just references to the relevant variables -
correct me if i'm wrong on this) 3. failing 1 & 2 how easy would it be
to create new versions of lm and glm that don't use so much memory? (Not
that I'm volunteering or anything ;) ). There is no need to hold
individual residuals in memory or make copies of the variables (at least
for my purposes). How well documented is the source code?

cheers
Damien Moore

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.