[R] very fast OLS regression?

Wed Mar 25 22:43:16 CET 2009

Ivo,

I would be careful about dealing with situations where x may be numerically ill-conditioned.  Your code will do quite badly in such cases.  Here is an extreme illustration:

set.seed(123)
x <- matrix(rnorm(20), 10, 2)
x <- cbind(x, x[,1])   # x is clearly rank-deficient
y <- c(x %*% c(1,2,3))

> byhand(y, x)
Error in solve.default(t(x) %*% x) : 
  Lapack routine dgesv: system is exactly singular
>

A better approach would be to explicitly use "QR" decomposition in your byhand() function:

byhand.qr = function( y, x, tol=1.e-07 ) {
qr.x <- qr(x, tol=tol, LAPACK=TRUE)
b<-qr.coef(qr.x, y)
## I will need these later, too:
## res<- qr.res(qr.x, y)
## soa[i]<-b[2]
## sigmas[i]<-sd(res)
b;
}

> ans <- byhand.qr(y, x)
> ans
[1] 1.795725 2.000000 2.204275
>

You can verify that this a "valid" solution:

> y - c(x %*% ans)
 [1]  9.436896e-16  2.498002e-16 -8.881784e-16  0.000000e+00 -4.440892e-16  1.776357e-15  4.440892e-16  0.000000e+00  4.440892e-16 -4.440892e-16
>

However, this is not the "unique" solution, since there an infinite number of solutions.  We can get "the" solution that has the minimum length by using Moore-Penrose inverse:

> require(MASS)
> ans.min <- c(ginv(x) %*% y)
> ans.min
[1] 2 2 2
>

You can verify that ans.min has a smaller L2-norm than ans, and it is unique.

Hope this helps,
Ravi.

____________________________________________________________________

Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvaradhan at jhmi.edu

----- Original Message -----
From: ivo welch <ivowel at gmail.com>
Date: Wednesday, March 25, 2009 4:31 pm
Subject: [R] very fast OLS regression?
To: r-help <r-help at stat.math.ethz.ch>

> Dear R experts:
>  
>  I just tried some simple test that told me that hand computing the OLS
>  coefficients is about 3-10 times as fast as using the built-in lm()
>  function.   (code included below.)  Most of the time, I do not care,
>  because I like the convenience, and I presume some of the time goes
>  into saving a lot of stuff that I may or may not need.  But when I do
>  want to learn the properties of an estimator whose input contains a
>  regression, I do care about speed.
>  
>  What is the recommended fastest way to get regression coefficients in
>  R?  (Is Gentlemen's weighted-least-squares algorithm implemented in a
>  low-level C form somewhere?  that one was always lightning fast for
>  me.)
>  
>  regards,
>  
>  /ivo
>  
>  
>  
>  bybuiltin = function( y, x )   coef(lm( y ~ x -1 ));
>  
>  byhand = function( y, x ) {
>    xy<-t(x)%*%y;
>    xxi<- solve(t(x)%*%x)
>    b<-as.vector(xxi%*%xy)
>    ## I will need these later, too:
>    ## res<-y-as.vector(x%*%b)
>    ## soa[i]<-b[2]
>    ## sigmas[i]<-sd(res)
>    b;
>  }
>  
>  
>  MC=500;
>  N=10000;
>  
>  
>  set.seed(0);
>  x= matrix( rnorm(N*MC), nrow=N, ncol=MC );
>  y= matrix( rnorm(N*MC), nrow=N, ncol=MC );
>  
>  ptm = proc.time()
>  for (mc in 1:MC) byhand(y[,mc],x[,mc]);
>  cat("By hand took ", proc.time()-ptm, "\n");
>  
>  ptm = proc.time()
>  for (mc in 1:MC) bybuiltin(y[,mc],x[,mc]);
>  cat("By built-in took ", proc.time()-ptm, "\n");
>  
>  ______________________________________________
>  R-help at r-project.org mailing list
>  
>  PLEASE do read the posting guide 
>  and provide commented, minimal, self-contained, reproducible code.