[R] lean and mean lm/glm?

Wed Aug 23 18:14:15 CEST 2006

Thomas Lumley < tlumley at u.washington.edu > wrote:

>I have written most of a bigglm() function where the data= argument is a 
>function with a single argument 'reset'. When called with reset=FALSE the 
>function should return another chunk of data, or NULL if no data are 
>available, and when called with reset=TRUE it should go back to the 
>beginning of the data. I don't think this is too inelegant.

yes, that does sound like a pretty elegent solution. It would be even more so if you could offer a default implementation of the data_function that simply passes chunks of large X and y matrices held in memory. (ideally you would just intialize the data_function to reference the X and y data to avoid duplicating it, don't know if that's possible in R.) how long before its ready? :)

--- On Wed 08/23, Thomas Lumley < tlumley at u.washington.edu > wrote:

From: Thomas Lumley [mailto: tlumley at u.washington.edu]
To: damien.moore at excite.com
Cc: r-help at stat.math.ethz.ch, ripley at stats.ox.ac.uk
Date: Wed, 23 Aug 2006 08:25:54 -0700 (PDT)
Subject: Re: [R] lean and mean lm/glm?

On Wed, 23 Aug 2006, Damien Moore wrote:

>
> Thomas Lumley wrote:
>
>> No, it is quite straightforward if you are willing to make multiple passes
>> through the data. It is hard with a single pass and may not be possible
>> unless the data are in random order.
>>
>> Fisher scoring for glms is just an iterative weighted least squares
>> calculation using a set of 'working' weights and 'working' response. These
>> can be defined chunk by chunk and fed to biglm. Three iterations should
>> be sufficient.
>
> (NB: Although not stated clearly I was referring to single pass when I 
> wrote "impossible"). Doing as you suggest with multiple passes would 
> entail either sticking the database input calls into the main iterative 
> loop of a lookalike glm.fit or lumping the user with a very unattractive 
> sequence of calls:

I have written most of a bigglm() function where the data= argument is a 
function with a single argument 'reset'. When called with reset=FALSE the 
function should return another chunk of data, or NULL if no data are 
available, and when called with reset=TRUE it should go back to the 
beginning of the data. I don't think this is too inelegant.

In general I don't think a one-pass algorithm is possible. If the data are 
in random order then you could read one chunk, fit a glm, and set up a 
grid of coefficient values around the estimate. You then read the rest of 
the data, computing the loglikelihood and score function at each point in 
the grid. After reading all the data you can then fit a suitable smooth 
surface to the loglikelihood. I don't know whether this will give 
sufficient accuracy, though.

For really big data sets you are probably better off with the approach 
that Brian Ripley and Fei Chen used -- they have shown that it works and 
there unlikely to be anything much simpler that also works that they 
missed.

-thomas

Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle