[R] lean and mean lm/glm?
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Wed Aug 23 17:54:04 CEST 2006
Thomas Lumley wrote:
> On Wed, 23 Aug 2006, Damien Moore wrote:
>> Thomas Lumley wrote:
>>> No, it is quite straightforward if you are willing to make multiple passes
>>> through the data. It is hard with a single pass and may not be possible
>>> unless the data are in random order.
>>> Fisher scoring for glms is just an iterative weighted least squares
>>> calculation using a set of 'working' weights and 'working' response. These
>>> can be defined chunk by chunk and fed to biglm. Three iterations should
>>> be sufficient.
>> (NB: Although not stated clearly I was referring to single pass when I
>> wrote "impossible"). Doing as you suggest with multiple passes would
>> entail either sticking the database input calls into the main iterative
>> loop of a lookalike glm.fit or lumping the user with a very unattractive
>> sequence of calls:
> I have written most of a bigglm() function where the data= argument is a
> function with a single argument 'reset'. When called with reset=FALSE the
> function should return another chunk of data, or NULL if no data are
> available, and when called with reset=TRUE it should go back to the
> beginning of the data. I don't think this is too inelegant.
> In general I don't think a one-pass algorithm is possible. If the data are
> in random order then you could read one chunk, fit a glm, and set up a
> grid of coefficient values around the estimate. You then read the rest of
> the data, computing the loglikelihood and score function at each point in
> the grid. After reading all the data you can then fit a suitable smooth
> surface to the loglikelihood. I don't know whether this will give
> sufficient accuracy, though.
> For really big data sets you are probably better off with the approach
> that Brian Ripley and Fei Chen used -- they have shown that it works and
> there unlikely to be anything much simpler that also works that they
> Thomas Lumley Assoc. Professor, Biostatistics
> tlumley at u.washington.edu University of Washington, Seattle
What I would like to see someone work on is a kind of SQL code generator
that given a set of weights passes through the database and computes a
new weighted information matrix. The code generator would make the
design matrix a symbolic entity. SQL or other suitable framework would
return the p x p matrix for one iteration at a time.
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help