[R] R versus SAS: lm performance

Prof Brian Ripley ripley at stats.ox.ac.uk
Tue May 11 09:07:41 CEST 2004


The way to time things in R is system.time().

Without knowing much more about your problem we can only guess where R is 
spending the time.  But you can find out by profiling -- see `Writing R 
Extensions'.

If you want multiple fits with the same design matrix (do you?) you 
could look at the code of lm and call lm.fit repeatedly yourself.

On Mon, 10 May 2004 Arne.Muller at aventis.com wrote:

> Hello,
> 
> A collegue of mine has compared the runtime of a linear model + anova in SAS and S+. He got the same results, but SAS took a bit more than a minute whereas S+ took 17 minutes. I've tried it in R (1.9.0) and it took 15 min. Neither machine run out of memory, and I assume that all machines have similar hardware, but the S+ and SAS machines are on windows whereas the R machine is Redhat Linux 7.2.
> 
> My question is if I'm doing something wrong (technically) calling the lm routine, or (if not), how I can optimize the call to lm or even using an alternative to lm. I'd like to run about 12,000 of these models in R (for a gene expression experiment - one model per gene, which would take far too long).
> 
> I've run the follwong code in R (and S+):
> 
> > options(contrasts=c('contr.helmert', 'contr.poly'))
> 
> The 1st colum is the value to be modeled, and the others are factors.
> 
> > names(df.gene1data) <- c("Va", "Ba", "Ti", "Do", "Ar", "Pr")
> > df[c(1:2,1343:1344),]
>            Va    Do  Ti  Ba Ar    Pr
> 1    2.317804 000mM 24h NEW  1     1
> 2    2.495390 000mM 24h NEW  2     1
> 8315 2.979641 025mM 04h PRG 83    16
> 8415 4.505787 000mM 04h PRG 84    16
> 
> this is a dataframe with 1344 rows.
> 
> x <- Sys.time();
> wlm <- lm(Va ~
> Ba+Ti+Do+Pr+Ba:Ti+Ba:Do+Ba:Pr+Ti:Do+Ti:Pr+Do:Pr+Ba:Ti:Do+Ba:Ti:Pr+Ba:Do:Pr+Ti:Do:Pr+Ba:Ti:Do:Pr+(Ba:Ti:Do)/Ar, data=df, singular=T);
> difftime(Sys.time(), x)
> 
> Time difference of 15.33333 mins
> 
> > anova(wlm)
> Analysis of Variance Table
> 
> Response: Va
>              Df Sum Sq Mean Sq   F value    Pr(>F)    
> Ba            2    0.1     0.1    0.4262  0.653133    
> Ti            1    2.6     2.6   16.5055 5.306e-05 ***
> Do            4    6.8     1.7   10.5468 2.431e-08 ***
> Pr           15 5007.4   333.8 2081.8439 < 2.2e-16 ***
> Ba:Ti         2    3.2     1.6    9.8510 5.904e-05 ***
> Ba:Do         7    2.8     0.4    2.5054  0.014943 *  
> Ba:Pr        30   80.6     2.7   16.7585 < 2.2e-16 ***
> Ti:Do         4    8.7     2.2   13.5982 9.537e-11 ***
> Ti:Pr        15    2.4     0.2    1.0017  0.450876    
> Do:Pr        60   10.2     0.2    1.0594  0.358551    
> Ba:Ti:Do      7    1.4     0.2    1.2064  0.296415    
> Ba:Ti:Pr     30    5.6     0.2    1.1563  0.259184    
> Ba:Do:Pr    105   14.2     0.1    0.8445  0.862262    
> Ti:Do:Pr     60   14.8     0.2    1.5367  0.006713 ** 
> Ba:Ti:Do:Pr 105   15.8     0.2    0.9382  0.653134    
> Ba:Ti:Do:Ar  56   26.4     0.5    2.9434 2.904e-11 ***
> Residuals   840  134.7     0.2                        
> 
> The corresponding SAS program from my collegue is:
> 
> proc glm data = "the name of the data set";
> 
> class B T D A P;
> 
> model V = B T D P B*T B*D B*P T*D T*P D*P B*T*D B*T*P B*D*P T*D*P B*T*D*P A(B*T*D);
> 
> run;
> 
> Note, V = Va, B = Ba, T = Ti, D = Do, P = Pr, A = Ar of the R-example

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list