[R] lean and mean lm/glm?

Mon Aug 21 19:49:14 CEST 2006

Hi All: I'm new to R and have a few questions about getting R to run efficiently with large datasets.

I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after the usual windows overhead). I have a dataset that has 4 million observations and about 20 variables. I want to run probit regressions on this data, but can't do this with more than about 500,000 observations before I start running out of ram (you could argue that I'm getting sufficient precision with <500,000 obs but lets pretend otherwise). Loading 500,000 observations into a data frame only takes about 100Mb of ram, so that isn't the problem. Instead it seems R uses huge amount of memory when running the glm methods. I called the Fortran routines that lm and glm use directly but even they create a large number of extraneous variables in the output (e.g. the Xs, ys, residuals etc) and during processing. For instance (sample code)

x=runif(1000000)
y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the 7mb of ram required to store y during processing, but cleans up ok afterwards with a gc() call
X=cbind(x)
p=ncol(X)
n=NROW(y)
ny=NCOL(y)
tol=1e-7
#this is the fortran routine called by lm - regressing y on X here
z <- .Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, 
tol = as.double(tol), coefficients = mat.or.vec(p, ny), 
residuals = y, effects = y, rank = integer(1), pivot = 1:p, 
qraux = double(p), work = double(2 * p), PACKAGE = "base")

This code runs very quickly - suggesting that in principle R should have no problem at all handling very large data sets, but uses >100mb during processing and z is about a 20mb object. Scaling this up to a much larger dataset with many variables its easy to see i'm going to run into problems

My questions:
1. are there any memory efficient alternatives to lm/glm in R?
2. is there any way to prevent the Fortran routine "dqrls" from producing so much output? (I suspect not since its output has to be compatible with the summary method, which seems to rely on having a copy of all variables instead of just references to the relevant variables - correct me if i'm wrong on this)
3. failing 1 & 2 how easy would it be to create new versions of lm and glm that don't use so much memory? (Not that I'm volunteering or anything ;) ). There is no need to hold individual residuals in memory or make copies of the variables (at least for my purposes). How well documented is the source code?

cheers
Damien Moore