[R] Seemingly simple "lm" giving unexpected results

Sat Apr 14 21:51:18 CEST 2012

On 14-04-2012, at 21:45, peter dalgaard wrote:

> 
> On Apr 14, 2012, at 14:40 , Berend Hasselman wrote:
> 
>> 
>> On 13-04-2012, at 22:20, Gene Leynes wrote:
>> 
>>> I can't figure out why this is returning an NA for the slope in one case,
>>> but not in the other.
>>> 
>>> I can tell that R thinks the first case is singular, but why isn't the
>>> second?
>>> 
>>> ## Define X and Y
>>> ## There are two versions of x
>>> ##     1) "as is"
>>> ##     2) shifted to start at 0
>>> y  = c(58, 57, 57, 279, 252, 851, 45, 87, 47)
>>> x1 = c(1334009411.437, 1334009411.437, 1334009411.437, 1334009469.297,
>>>      1334009469.297, 1334009469.297, 1334009485.697, 1334009485.697,
>>> 1334009485.697)
>>> x2 = x1 - min(x1)
>>> 
>>> ## Why doesn't the LM model work for the "as is" x?
>>> lm(y~x1)
>>> lm(y~x2)
>> 
>> 
>> With your data  the matrix t(X)%*%X is extremely ill-conditioned for the case with x1.
>> See
>> 
>> http://en.wikipedia.org/wiki/Condition_number
>> http://mathworld.wolfram.com/ConditionNumber.html
>> http://en.wikipedia.org/wiki/Numerical_methods_for_linear_least_squares
>> 
>> You can check it with
>> 
>> makeXX <- function(x) {
>>   matrix(data=c(x,rep(1,length(x))),nrow=length(x), byrow=FALSE)
>> }
>> 
>> X1 <- makeXX(x1)
>> (XTX1 <- t(X1) %*% X1)
>> svd(XTX1)
>> 
>> and similar for x2.
> 
> <lecture on numerical linear algebra>
> 
> lm() is actually a bit smarter than to use the textbook formula (X'X)^{-1}X'Y, it is internally based on a QR decomposition which is numerically far more stable. What it does is effectively to orthogonalize the columns of X successively. 
> 
>> x <- cbind(1, x1)
>> qr(x)
> $qr
>                            x1
> [1,] -3.0000000 -4.002028e+09
> [2,]  0.3333333  9.555777e+01
> [3,]  0.3333333  3.456548e-01
> [4,]  0.3333333 -2.598428e-01
> [5,]  0.3333333 -2.598428e-01
> [6,]  0.3333333 -2.598428e-01
> [7,]  0.3333333 -4.314668e-01
> [8,]  0.3333333 -4.314668e-01
> [9,]  0.3333333 -4.314668e-01
> 
> $rank
> [1] 1
> 
> $qraux
> [1] 1.333333 1.345655
> 
> $pivot
> [1] 1 2
> 
> attr(,"class")
> [1] "qr"
> 
> However, notice the rank of 1. That's because it wants to protect the user against unwittingly plugging in a singular design matrix, so when the algorithm encounters a variable that looks like it is getting reduced to zero by after orthogonalization, it basically throws it out.
> 
> You can defeat this mechanism by setting the tol= argument sufficiently low. In fact, you can do the same with lm itself:
> 
>> lm(y ~ x1, tol = 0)
> ...
> (Intercept)           x1  
> -2.474e+09    1.854e+00  
> 
> Notice that the slope is consistent with the 
> 
>> lm(y ~ x2)
> ...
> (Intercept)           x2  
>    110.884        1.854  
> 
> In contrast if you try the textbook way:
> 
>> solve(crossprod(x), crossprod(x,y))
> Error in solve.default(crossprod(x), crossprod(x, y)) : 
>  system is computationally singular: reciprocal condition number = 2.67813e-34
> 
> Ah, well, let's try and defeat that:
> 
>> solve(crossprod(x), crossprod(x,y), tol=0)
>            [,1]
>   -2.959385e+09
> x1  2.218414e+00
> 
> Which, as you'll notice is, er, somewhat off the mark.
> 
> </lecture on numerical linear algebra>

Thank you. Indeed. I'm perfectly well aware of this.
I was trying to illustrate the problem.

Berend