[R] R versus SAS: lm performance

Tue May 11 14:19:40 CEST 2004

I tried the following on an Opteron 248, R-1.9.0 w/Goto's BLAS:

> y <- matrix(rnorm(14000*1344), 1344)
> x <- matrix(runif(1344*503),1344)
> system.time(fit <- lm(y~x))
[1] 106.00  55.60 265.32   0.00   0.00

The resulting fit object is over 600MB.  (The coefficient compoent is a 504
x 14000 matrix.)

If I'm not mistaken, SAS sweeps on the extended cross product matrix to fit
regression models.  That, I believe, in usually faster than doing QR
decomposition on the model matrix itself, but there are trade-offs.  You
could try what Prof. Bates suggested.

Andy

> From: Arne.Muller at aventis.com
> 
> Hello,
> 
> thanks for your reply. I've now done the profiling, and I 
> interpret that the most time is spend in the fortran routine(s):
> 
> Each sample represents 0.02 seconds.
> Total run time: 920.219999999453 seconds.
> 
> Total seconds: time spent in function and callees.
> Self seconds: time spent in function alone.
> 
>    %       total       %       self
>  total    seconds     self    seconds    name
> 100.00    920.22      0.02      0.16     "lm"
>  99.96    919.88      0.10      0.88     "lm.fit"
>  99.74    917.84     99.74    917.84     ".Fortran"
>   0.07      0.66      0.02      0.14     "storage.mode<-"
>   0.06      0.52      0.00      0.00     "eval"
>   0.06      0.52      0.04      0.34     "as.double"
>   0.02      0.22      0.02      0.22     "colnames<-"
>   0.02      0.20      0.02      0.20     "structure"
>   0.02      0.18      0.02      0.18     "model.matrix.default"
>   0.02      0.18      0.02      0.18     "as.double.default"
>   0.02      0.18      0.00      0.00     "model.matrix"
>   0.01      0.08      0.01      0.08     "list"
> 
>    %       self        %       total
>  self     seconds    total    seconds    name
>  99.74    917.84     99.74    917.84     ".Fortran"
>   0.10      0.88     99.96    919.88     "lm.fit"
>   0.04      0.34      0.06      0.52     "as.double"
>   0.02      0.22      0.02      0.22     "colnames<-"
>   0.02      0.20      0.02      0.20     "structure"
>   0.02      0.18      0.02      0.18     "as.double.default"
>   0.02      0.18      0.02      0.18     "model.matrix.default"
>   0.02      0.16    100.00    920.22     "lm"
>   0.02      0.14      0.07      0.66     "storage.mode<-"
>   0.01      0.08      0.01      0.08     "list"
> 
> I guess this actually means I cannot do anything about it ... 
> other than maybe splitting the problem into different 
> (independaent parts - which I actually may be able to).
> 
> Regarding the usage of lm.fit instead of lm, this might be a 
> good idea, since I am using the same model.matrix for all 
> fits! However, I'd need to recreate an lm object from the 
> output, because I'd like to run the anova function on this. 
> I'll first do some profiling on lm versus lm.fit for the 
> 12,000 models ...
> 
> 	kind regards + thanks again for your help,
> 
> 	Arne
> 
> --
> Arne Muller, Ph.D.
> Toxicogenomics, Aventis Pharma
> arne dot muller domain=aventis com
> 
> > -----Original Message-----
> > From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> > Sent: 11 May 2004 09:08
> > To: Muller, Arne PH/FR
> > Cc: r-help at stat.math.ethz.ch
> > Subject: Re: [R] R versus SAS: lm performance
> > 
> > 
> > The way to time things in R is system.time().
> > 
> > Without knowing much more about your problem we can only 
> > guess where R is 
> > spending the time.  But you can find out by profiling -- see 
> > `Writing R 
> > Extensions'.
> > 
> > If you want multiple fits with the same design matrix (do you?) you 
> > could look at the code of lm and call lm.fit repeatedly yourself.
> > 
> > On Mon, 10 May 2004 Arne.Muller at aventis.com wrote:
> > 
> > > Hello,
> > > 
> > > A collegue of mine has compared the runtime of a linear 
> > model + anova in SAS and S+. He got the same results, but SAS 
> > took a bit more than a minute whereas S+ took 17 minutes. 
> > I've tried it in R (1.9.0) and it took 15 min. Neither 
> > machine run out of memory, and I assume that all machines 
> > have similar hardware, but the S+ and SAS machines are on 
> > windows whereas the R machine is Redhat Linux 7.2.
> > > 
> > > My question is if I'm doing something wrong (technically) 
> > calling the lm routine, or (if not), how I can optimize the 
> > call to lm or even using an alternative to lm. I'd like to 
> > run about 12,000 of these models in R (for a gene expression 
> > experiment - one model per gene, which would take far too long).
> > > 
> > > I've run the follwong code in R (and S+):
> > > 
> > > > options(contrasts=c('contr.helmert', 'contr.poly'))
> > > 
> > > The 1st colum is the value to be modeled, and the others 
> > are factors.
> > > 
> > > > names(df.gene1data) <- c("Va", "Ba", "Ti", "Do", "Ar", "Pr")
> > > > df[c(1:2,1343:1344),]
> > >            Va    Do  Ti  Ba Ar    Pr
> > > 1    2.317804 000mM 24h NEW  1     1
> > > 2    2.495390 000mM 24h NEW  2     1
> > > 8315 2.979641 025mM 04h PRG 83    16
> > > 8415 4.505787 000mM 04h PRG 84    16
> > > 
> > > this is a dataframe with 1344 rows.
> > > 
> > > x <- Sys.time();
> > > wlm <- lm(Va ~
> > > 
> > Ba+Ti+Do+Pr+Ba:Ti+Ba:Do+Ba:Pr+Ti:Do+Ti:Pr+Do:Pr+Ba:Ti:Do+Ba:Ti
> > :Pr+Ba:Do:Pr+Ti:Do:Pr+Ba:Ti:Do:Pr+(Ba:Ti:Do)/Ar, data=df, 
> singular=T);
> > > difftime(Sys.time(), x)
> > > 
> > > Time difference of 15.33333 mins
> > > 
> > > > anova(wlm)
> > > Analysis of Variance Table
> > > 
> > > Response: Va
> > >              Df Sum Sq Mean Sq   F value    Pr(>F)    
> > > Ba            2    0.1     0.1    0.4262  0.653133    
> > > Ti            1    2.6     2.6   16.5055 5.306e-05 ***
> > > Do            4    6.8     1.7   10.5468 2.431e-08 ***
> > > Pr           15 5007.4   333.8 2081.8439 < 2.2e-16 ***
> > > Ba:Ti         2    3.2     1.6    9.8510 5.904e-05 ***
> > > Ba:Do         7    2.8     0.4    2.5054  0.014943 *  
> > > Ba:Pr        30   80.6     2.7   16.7585 < 2.2e-16 ***
> > > Ti:Do         4    8.7     2.2   13.5982 9.537e-11 ***
> > > Ti:Pr        15    2.4     0.2    1.0017  0.450876    
> > > Do:Pr        60   10.2     0.2    1.0594  0.358551    
> > > Ba:Ti:Do      7    1.4     0.2    1.2064  0.296415    
> > > Ba:Ti:Pr     30    5.6     0.2    1.1563  0.259184    
> > > Ba:Do:Pr    105   14.2     0.1    0.8445  0.862262    
> > > Ti:Do:Pr     60   14.8     0.2    1.5367  0.006713 ** 
> > > Ba:Ti:Do:Pr 105   15.8     0.2    0.9382  0.653134    
> > > Ba:Ti:Do:Ar  56   26.4     0.5    2.9434 2.904e-11 ***
> > > Residuals   840  134.7     0.2                        
> > > 
> > > The corresponding SAS program from my collegue is:
> > > 
> > > proc glm data = "the name of the data set";
> > > 
> > > class B T D A P;
> > > 
> > > model V = B T D P B*T B*D B*P T*D T*P D*P B*T*D B*T*P B*D*P 
> > T*D*P B*T*D*P A(B*T*D);
> > > 
> > > run;
> > > 
> > > Note, V = Va, B = Ba, T = Ti, D = Do, P = Pr, A = Ar of the 
> > R-example
> > 
> > -- 
> > Brian D. Ripley,                  ripley at stats.ox.ac.uk
> > Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> > University of Oxford,             Tel:  +44 1865 272861 (self)
> > 1 South Parks Road,                     +44 1865 272866 (PA)
> > Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> > 
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>