[R] R versus SAS: lm performance

Tue May 11 14:42:41 CEST 2004

I would be curious to know how sparse the model.matrix for this problem 
is...
Unless it is quite dense, or as Brian implies quite singular, I might 
suggest
computing a Cholesky factorization in SparseM.

url:	www.econ.uiuc.edu/~roger        	Roger Koenker
email	rkoenker at uiuc.edu			Department of Economics
vox: 	217-333-4558				University of Illinois
fax:   	217-244-6678				Champaign, IL 61820

On May 11, 2004, at 7:07 AM, Douglas Bates wrote:

> <Arne.Muller at aventis.com> writes:
>
>> Hello,
>>
>> A collegue of mine has compared the runtime of a linear model + anova 
>> in SAS and S+. He got the same results, but SAS took a bit more than 
>> a minute whereas S+ took 17 minutes. I've tried it in R (1.9.0) and 
>> it took 15 min. Neither machine run out of memory, and I assume that 
>> all machines have similar hardware, but the S+ and SAS machines are 
>> on windows whereas the R machine is Redhat Linux 7.2.
>>
>> My question is if I'm doing something wrong (technically) calling the 
>> lm routine, or (if not), how I can optimize the call to lm or even 
>> using an alternative to lm. I'd like to run about 12,000 of these 
>> models in R (for a gene expression experiment - one model per gene, 
>> which would take far too long).
>>
>> I've run the follwong code in R (and S+):
>
> ...
>
> As Brian Ripley mentioned, you could save the model matrix and use it
> with each of your responses.  Versions 0.8-1 and later of the Matrix
> package have a vignette that provides comparative timings of various
> ways of obtaining the least squares estimates.  If you use the classes
> from the Matrix package and create and save the crossproduct of the
> model matrix
>
> mm = as(model.matrix(Va ~ Ba+Ti..., df), "geMatrix")
> cprod = crossprod(mm)
>
> then successive calls to
>
> coef = solve(cprod, crossprod(mm, df$Va))
>
> will produce the coefficient estimates much faster than will calls to
> lm, which each do all the work of generating and decomposing the very
> large model matrix.
>
> Note that this method only produces the coefficient estimates, which
> may be enough for your purposes.  Also, this method will not handle
> missing data or rank-deficient model matrices in the elegant way that
> lm does.
>
> If you are doing this 12,000 times it may be worthwhile checking if
> the sparse matrix formulation
>
> mmS = as(mm, "cscMatrix")
> cprodS = crossprod(mmS)
>
> is faster.
>
> The dense matrix formulation (but not the sparse) can benefit from
> installation of optimized BLAS routines such as Atlas or Goto's BLAS.
>
> -- 
> Douglas Bates                            bates at stat.wisc.edu
> Statistics Department                    608/262-2598
> University of Wisconsin - Madison        
> http://www.stat.wisc.edu/~bates/
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html