[R] lean and mean lm/glm?

Wed Aug 23 18:15:47 CEST 2006

On Wed, 23 Aug 2006, Frank E Harrell Jr wrote:

> Thomas Lumley wrote:
> > On Wed, 23 Aug 2006, Damien Moore wrote:
> > 
> > > Thomas Lumley wrote:
> > >
> > > > No, it is quite straightforward if you are willing to make multiple
> > > > passes
> > > > through the data. It is hard with a single pass and may not be possible
> > > > unless the data are in random order.
> > > >
> > > > Fisher scoring for glms is just an iterative weighted least squares
> > > > calculation using a set of 'working' weights and 'working' response.
> > > > These
> > > > can be defined chunk by chunk and fed to biglm. Three iterations should
> > > > be sufficient.
> > > (NB: Although not stated clearly I was referring to single pass when I
> > > wrote "impossible"). Doing as you suggest with multiple passes would
> > > entail either sticking the database input calls into the main iterative
> > > loop of a lookalike glm.fit or lumping the user with a very unattractive
> > > sequence of calls:
> > 
> > I have written most of a bigglm() function where the data= argument is a
> > function with a single argument 'reset'. When called with reset=FALSE the
> > function should return another chunk of data, or NULL if no data are
> > available, and when called with reset=TRUE it should go back to the
> > beginning of the data.  I don't think this is too inelegant.
> > 
> > In general I don't think a one-pass algorithm is possible. If the data are
> > in random order then you could read one chunk, fit a glm, and set up a grid
> > of coefficient values around the estimate.  You then read the rest of the
> > data, computing the loglikelihood and score function at each point in the
> > grid.  After reading all the data you can then fit a suitable smooth surface
> > to the loglikelihood.  I don't know whether this will give sufficient
> > accuracy, though.

Not in general.  One of the problems with a binomial/Poisson glm is the
geometry of the likelihood can be radically changed by a single case: 
suppose that the initial sample were separable?  Misclassifications can 
really get you: one case with an incorrect label can contribute 
arbitrarily much to the log-likelihood (and I have seen 11,000 units).

> > For really big data sets you are probably better off with the approach that
> > Brian Ripley and Fei Chen used -- they have shown that it works and there
> > unlikely to be anything much simpler that also works that they missed.
> 
> What I would like to see someone work on is a kind of SQL code generator that
> given a set of weights passes through the database and computes a new weighted
> information matrix.  The code generator would make the design matrix a
> symbolic entity.  SQL or other suitable framework would return the p x p
> matrix for one iteration at a time.

That is one of the things Fei Chen and I did (using SQL extensions in 
MySQL), and probably the most successful.  We managed to explore models 
with tens of millions of cases with 30 categorical explanatory vars (and 
the data structure and problem was such that this was worthwhile, not 
least because predictions might be needed for 0.01% of the coverage).

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595