[R] lm handling of ill-conditioned systems

Thomas Lumley thomas at biostat.washington.edu
Mon Mar 20 18:15:52 CET 2000


On 20 Mar 2000, Tomislav Goles wrote:

> 
> The lm() function in R seems to handle the inversion of singular X'X matrices
> (where there is collinearity between regression inputs) in a way
> where one of the inputs is dropped and this also seems to be the
> default behavior in SAS (please let me know if i'm wrong about this).
> In some other packages (i.e. octave ols() function) the pseudo 
> inverse is computed where singular values less then some small threshold are 
> not included in computation of inverse but the data columns themselves are not 
> dropped).
> I am wandering if few experts could comment on what is a better (best?) way to 
> handle singular matrices in this context or to perhaps point me to some 
> literature on this.


When the design matrix is not of full rank there is no information in the
data about certain linear combinations of the variables.   This means that
there isn't unique least squares or maximum likelihood estimate for the
parameters. This isn't a numerical analysis issue, it's a statistical one.

While it is possible to pick out a single estimate using some arbitrary
criterion, it is not possible to construct a useful covariance matrix for
these parameters. For example, if a regression has two identical
predictors the variance of each parameter estimators and their covariance
will be (+/-)infinite, even though the sum of the estimators really has a
sensible finite variance.

For this reason you don't gain much, if anything, by allowing least
squares estimation of all the parameters.  It's also more work -- I
believe the algorithm involves the singular value decomposition. People
who want to estimate non-identifiable parameters can work it out
themselves.

In S/R and in Stata (and possibly other statistics packages) variables are
dropped to create a submodel that can be estimated properly.  In S/R the
remaining coefficients are set to NA to emphasize that they are completely
unknown.  I don't know if this was orginally a design intention or
whether it was the simplest way to implement the computations, but it
makes as much sense as anything else and forces the user to notice that
the model was unidentifiable.

As a technical note, R does not in fact invert X'X to fit a linear model.
We find the QR decomposition of X and solve Rbeta=Q'Y.  This results in
roughly half as many digits of rounding error as if we had inverted X'X at
a relatively modest additional computing cost.  The tolerance we currently
use in lm means that a column is discarded if it's diagonal element of R
would be 10^-7 times smaller than the largest one.  I think this
translates into a variance inflation factor of about 10^14, so when we
say 'singular' we mean it.



	-thomas

Thomas Lumley
Assistant Professor, Biostatistics
University of Washington, Seattle


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list