[R] Efficient way to perform linear regressions

Tue Feb 26 17:41:25 CET 2013

Hi All,

I have millions of regression lines to fit. So I am looking for the
most efficient approach in R.

Details:
I have a large desing matrix X. The dimension is n by p.
Each time when fitting the model, select rows from this matrix X and
form a new design matrix, called  X_current.
There is another binary matrix M, with dim m by n, and each row is a
1*n vector. It helps to determin X_current.
For instance, if the first row of M is (1,0,1,0,1,..,1), then I remove
the 2nd and 4th row from X, and use this submatrix as the design
matrix to fit the model.
if the second row is (0,0,0,1,...,1), then I use X[4:n,] as input
matrix to do the regression,
etc.

I would like to extract a single element of estimated parameter beta,
and the SSE from the results.

So far what I know is to use these functions which performs better than lm():
1. lsfit()  # this works faster than lm(), but have to extract the
results without using ls.print(print=F), maybe I/O slows down. And not
better than the other two approaches
2. directly use solve(X'X) and the formula from score equation to get
the results  # not sure how well for almost singular or wired matrices
3. directly use qr() with LAPACK or qr.solve()  and the same formula
# qr decomposion is reliable but takes a little more time than solve()

Below is a piece of code for demonstration and time comparision. I do
not know the best way to generate random non-singular design matrix,
so I add try() to ignore errors.
I did not include apply/sapply/lapply, seems they can save the memory
but in my case do not improve. Thank you.

#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

M = matrix( NA, nrow = 10^4, ncol = 200)
for (i in 1:10^4)
{
	M[i,] = sample(c(0,1), size=200, replace = T)
}

X = cbind( rnorm(200,1,10) ,1:200, sample(1:10^5,200,T))
Y = rnorm(200)

# loop version
resu1=matrix(NA,nrow=10^4, ncol=2)
resu2=matrix(NA,nrow=10^4, ncol=2)
resu3=matrix(NA,nrow=10^4, ncol=2)
resu4=matrix(NA,nrow=10^4, ncol=2)

t1 = proc.time()
for (i in 1:10^4)
{
	fit.ls = lsfit( x = X[M[i,],], y = Y[M[i,]], intercept=F)
	resu1[i,] = c( fit.ls$coeff[2], sum((fit.ls$res)^2))
}
t2 = proc.time()
for (i in 1:10^4)
{
	hatM = t(X[M[i,],])%*%X[M[i,],]
	try( beta <- ( solve( hatM ) %*% t( X[M[i,],] )%*%Y[M[i,]] ), silent = T)
	try( varr <- sum((Y[M[i,]]-X[M[i,],]%*%beta)^2), silent = T)
	try( resu2[i, ] <- c( beta[2], varr ), silent = T)
}
t3 = proc.time()

for (i in 1:10^4)
{
	hatM = t(X[M[i,],])%*%X[M[i,],]]
	try( beta <- ( qr.solve( hatM ) %*% t( X[M[i,],] )%*%Y[M[i,]] ), silent = T)
	try( varr <- sum(( Y[M[i,]]-X[M[i,],]%*%beta)^2 ), silent = T)
	try( resu2[i, ] <- c( beta[2], varr )), silent = T)
}
t4 = proc.time()

(t2-t1)[3]
(t3-t2)[3]
(t4-t3)[3]

#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------