[R] lean and mean lm/glm?

Wed Aug 23 17:54:04 CEST 2006

Thomas Lumley wrote:
> On Wed, 23 Aug 2006, Damien Moore wrote:
> 
>> Thomas Lumley wrote:
>>
>>> No, it is quite straightforward if you are willing to make multiple passes
>>> through the data. It is hard with a single pass and may not be possible
>>> unless the data are in random order.
>>>
>>> Fisher scoring for glms is just an iterative weighted least squares
>>> calculation using a set of 'working' weights and 'working' response. These
>>> can be defined chunk by chunk and fed to biglm. Three iterations should
>>> be sufficient.
>> (NB: Although not stated clearly I was referring to single pass when I 
>> wrote "impossible"). Doing as you suggest with multiple passes would 
>> entail either sticking the database input calls into the main iterative 
>> loop of a lookalike glm.fit or lumping the user with a very unattractive 
>> sequence of calls:
> 
> I have written most of a bigglm() function where the data= argument is a 
> function with a single argument 'reset'. When called with reset=FALSE the 
> function should return another chunk of data, or NULL if no data are 
> available, and when called with reset=TRUE it should go back to the 
> beginning of the data.  I don't think this is too inelegant.
> 
> In general I don't think a one-pass algorithm is possible. If the data are 
> in random order then you could read one chunk, fit a glm, and set up a 
> grid of coefficient values around the estimate.  You then read the rest of 
> the data, computing the loglikelihood and score function at each point in 
> the grid.  After reading all the data you can then fit a suitable smooth 
> surface to the loglikelihood.  I don't know whether this will give 
> sufficient accuracy, though.
> 
> For really big data sets you are probably better off with the approach 
> that Brian Ripley and Fei Chen used -- they have shown that it works and 
> there unlikely to be anything much simpler that also works that they 
> missed.
> 
> 
>  	-thomas
> 
> Thomas Lumley			Assoc. Professor, Biostatistics
> tlumley at u.washington.edu	University of Washington, Seattle

What I would like to see someone work on is a kind of SQL code generator 
that given a set of weights passes through the database and computes a 
new weighted information matrix.  The code generator would make the 
design matrix a symbolic entity.  SQL or other suitable framework would 
return the p x p matrix for one iteration at a time.

Frank

Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University