[R] LM: Least Squares on Large Datasets OR why lm() is designed the w ay it is

Fri Aug 9 20:45:51 CEST 2002

The information is equally stored in the QR decomposition as used by R,
*but* that approach is numerically much more accurate.  (BTW, your
approach needs means and centred covariances or it will be very poor
numerically.)  It's certainly possible to do this, and indeed it is
the philosophy behind the sweep operator that underlies much of SAS.

However, what you have overlooked is forming X.  That's the
memory-consuming part of the procedure, forming the model matrix.
In special cases that can be overcome: see the examples in `S
Programming'.  But that's why SAS has to treat factor variables separately
in regression.

I've never yet seen a large dataset (say over 10,000 rows) for which least
squares was an appopriate analysis (due to lack of homogeneity).  Also,
10,000 rows is sufficient to estimate the coefficients to very high
accuracy, to the point where non-random effects (e.g. lack of linearity,
missing predictors, omitted groupings etc) are probably larger.

lm is designed to provide the information statisticians use on reasonably
sized problems.  There are research projects (including one here) on doing
other calculations on very large datasets.

As an aside, stepAIC adds variables to the original fit, and that's not
possible within your proposed framework.

On Fri, 9 Aug 2002, Vadim Ogranovich wrote:

> I have always been wondering why S-Plus/R can not fit a linear model to an
> arbitrary large data set given that, I thought, it should be pretty
> straightforward. Sometime ago I came across a reference to LM package,
> http://www.econ.uiuc.edu/~anovo/LM.html, by Roger Koenker and Alvaro Novo.
> So I thought here it is at last, but to my surprise this project hasn't made
> to the recommended packages and its development seems to be stopped. I take
> it as a strong evidence that there is a conceptual problem in doing this
> sort of things and I thought it would be very educational for me to
> understand it.

Not in doing it, but in its being appropriate.

> Here is how I would structure lm object, please feel free to point mistakes
> out.
>
> Suppose we want to analyze lm(Y ~ X), where Y is a vector and X is a matrix
> 1. Under the classical assumptions of normality and independence of the
> residuals all information about the model is encapsulated in the covariance
> matrix of [Y,X] and the observation count, i.e. length(Y). These include
> variance of coefficients, their significance levels, ability to compute
> predictions, etc. Moreover, all sub-models, i.e. a regression on any subset
> of X columns are also readily computable, as well as ANOVA.
> Given this I'd store the covmatrix of [Y,X] and the count on an lm object
> and write summary.lm, anova.lm, step, stepAIC functions in terms of these
> two members only.
> I guess this is the idea behind the LM package.
>
> 2. There is whole lot of tests that are designed to check the classical
> assumptions of normality of the residuals, detect influential points, etc.
> Obviously these can not possibly be carried out without the residuals, etc.
> So the lm object should provide a slot for the residuals, but whether the
> residuals are in fact computed should not affect the functions mentioned in
> the previous paragraph.
>
>
> I will appreciate any comment on this "design".

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._