[R] Error in lm() with very small (close to zero) regressor

Sun Mar 29 12:42:23 CEST 2015

> On 28 Mar 2015, at 18:28 , Ben Bolker <bbolker at gmail.com> wrote:
> 
> peter dalgaard <pdalgd <at> gmail.com> writes:
> 
>> 
>> 
>>> On 28 Mar 2015, at 00:32 , RiGui <raluca.gui <at> business.uzh.ch> wrote:
>>> 
>>> Hello everybody,
>>> 
>>> I have encountered the following problem with lm():
>>> 
>>> When running lm() with a regressor close to zero - 
>> of the order e-10, the
>>> value of the estimate is of huge absolute value , of order millions. 
>>> 
>>> However, if I write the formula of the OLS estimator, 
>> in matrix notation:
>>> pseudoinverse(t(X)*X) * t(X) * y , the results are correct, meaning the
>>> estimate has value 0.
> 
>  How do you know this answer is "correct"?
> 
>>> here is the code:
>>> 
>>> y  <- rnorm(n_obs, 10,2.89)
>>> x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045)
>>> x2 <- rnorm(n_obs, 10,3.21)
>>> X  <- cbind(x1,x2)
>>> 
>>> bFE <- lm(y ~ x1 + x2)
>>> bFE
>>> 
>>> bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
>>> bOLS
>>> 
>>> 
>>> Note: I am applying a deviation from the 
>> mean projection to the data, that
>>> is why I have some regressors with such small values.
>>> 
>>> Thank you for any help!
>>> 
>>> Raluca
> 
>  Is there a reason you can't scale your regressors?
> 
>>> 
>> Example not reproducible:
>> 
> 
>  I agree that the OP's question was not reproducible, but it's
> not too hard to make it reproducible. I bothered to use
> library("sos"); findFn("pseudoinverse") to find pseudoinverse()
> in corpcor:

Well, it shouldn't be my work, nor yours... And I thought it particularly egregious to treat a function from an unspecified package as gospel.

> 
> It is true that we get estimates with very large magnitudes,
> but their 
> 
> set.seed(101)
> n_obs <- 1000
> y  <- rnorm(n_obs, 10,2.89)
> x1 <- rnorm(n_obs, mean=1.235657e-14,sd=4.5e-17)
> x2 <- rnorm(n_obs, 10,3.21)
> X  <- cbind(x1,x2)
> 
> bFE <- lm(y ~ x1 + x2)
> bFE
> coef(summary(bFE))
> 
>                 Estimate   Std. Error     t value  Pr(>|t|)
> (Intercept)  1.155959e+01 2.312956e+01  0.49977541 0.6173435
> x1          -1.658420e+14 1.872598e+15 -0.08856254 0.9294474
> x2           3.797342e-02 2.813000e-02  1.34992593 0.1773461
> 
> library("corpcor")
> bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
> bOLS
> 
>             [,1]
> [1,] 9.807664e-16
> [2,] 8.880273e-01
> 
> And if we scale the predictor:
> 
> bFE2 <- lm(y ~ I(1e14*x1) + x2)
> coef(summary(bFE2))
> 
>                 Estimate Std. Error     t value  Pr(>|t|)
> (Intercept)   11.55958731   23.12956  0.49977541 0.6173435
> I(1e+14 * x1) -1.65842037   18.72598 -0.08856254 0.9294474
> x2             0.03797342    0.02813  1.34992593 0.1773461
> 
> bOLS stays constant.
> 
> To be honest, I haven't thought about this enough to see
> which answer is actually correct, although I suspect the
> problem is in bOLS, since the numerical methods (unlike
> the brute-force pseudoinverse method given here) behind
> lm have been carefully considered for numerical stability.

In particular, the pseudoinverse() function has a tol= argument which allows it to zap small singular values. 

> pseudoinverse(crossprod(X))%*%crossprod(X,y)
             [,1]
[1,] 9.807664e-16
[2,] 8.880273e-01
> pseudoinverse(crossprod(X),tol=1e-40)%*%crossprod(X,y)
              [,1]
[1,]  1.286421e+15
[2,] -5.327384e-01

Also, notice that there is no intercept in the above, so it would be more reasonable to compare to 

> bFE <- lm(y ~ x1 + x2-1)
> summary(bFE)

Call:
lm(formula = y ~ x1 + x2 - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.1712 -1.9439 -0.0321  1.7637  9.4540 

Coefficients:
    Estimate Std. Error t value Pr(>|t|)    
x1 7.700e+14  2.435e+13   31.62   <2e-16 ***
x2 3.766e-02  2.811e-02    1.34    0.181    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.771 on 998 degrees of freedom
Multiple R-squared:  0.9275,	Adjusted R-squared:  0.9273 
F-statistic:  6382 on 2 and 998 DF,  p-value: < 2.2e-16

Or, maybe better to use cbind(1,X) in the pseudoinverse() method. It doesn't help, though.

What really surprises me is this

> X  <- cbind(x1,x2)
> crossprod(X)
             x1           x2
x1 1.526698e-25 1.265085e-10
x2 1.265085e-10 1.145462e+05
> solve(crossprod(X))
Error in solve.default(crossprod(X)) : 
  system is computationally singular: reciprocal condition number = 1.13052e-31
> solve(crossprod(X), tol=1e-40)
              x1            x2
x1  7.722232e+25 -8.528686e+10
x2 -8.528686e+10  1.029237e-04
> pseudoinverse(crossprod(X), tol=1e-40)
              [,1]          [,2]
[1,]  6.222152e+25 -6.217178e+10
[2,] -6.871948e+10  7.739466e-05

How does pseudoinverse() go so badly wrong? Notice that apart from the scaling, this really isn't very ill-conditioned, but a lot of accuracy seems to be lost in its SVD step. Notice that

> X  <- cbind(1e14*x1,x2)
> solve(crossprod(X))
                            x2
    0.0077222325 -0.0008528686
x2 -0.0008528686  0.0001029237
> pseudoinverse(crossprod(X))
              [,1]          [,2]
[1,]  0.0077222325 -0.0008528686
[2,] -0.0008528686  0.0001029237

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com