[R] lean and mean lm/glm?

Wed Aug 23 19:15:29 CEST 2006

On Wed, 23 Aug 2006, Damien Moore wrote:

>
> Thomas Lumley < tlumley at u.washington.edu > wrote:
>
>> I have written most of a bigglm() function where the data= argument is a
>> function with a single argument 'reset'. When called with reset=FALSE the
>> function should return another chunk of data, or NULL if no data are
>> available, and when called with reset=TRUE it should go back to the
>> beginning of the data. I don't think this is too inelegant.
>
> yes, that does sound like a pretty elegent solution. It would be even 
> more so if you could offer a default implementation of the data_function 
> that simply passes chunks of large X and y matrices held in memory.

I have done that for data frames.

> (ideally you would just intialize the data_function to reference the X 
> and y data to avoid duplicating it, don't know if that's possible in R.)

The part that is extracted is a copy. The whole thing isn't copied, 
though.

The chunk would have to be a copy if it were an R matrix because matrices 
are stored in continguous column-major format and a chunk won't be 
contiguous. I think an implementation that uses precomputed design 
matrices would want to be written in C and call the incremental QR 
decomposition routines row by row.  The reason for working in chunks in R 
is to allow model.frame and model.matrix to work reasonably efficiently, 
and they aren't needed if you already have the design matrix.

> how long before its ready? :)

Depends on how many more urgent things intervene.

 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle