[BioC] programming problem: running many ANOVAs

Mon Apr 19 18:51:36 CEST 2004

Hello,

I've a R/bioconductor programming question. For each gene in a matrix (emat) generated by exprs() I'm running an anova based on a goven model. The p-values for the anova and the coefficients from the lm for each gene are stored in a list and are returned.

A typical emat contains 12,000 rows and 40 colums or so. The probelm is that it runs about 10 minutes (P4 2.8GHz under linux with 2GB RAM) to complete.

I wonder if there is a more effifient way to program this routine (running the 12,000 lm's and ANOVAs). Is there a way to get rid of the generation of the data.frame that I pass to the lm?

I call my routine like this (dose and batch are two factors):

> dif.04h <- arnova(emat.04h, list(dose, batch), Value ~ dose + batch)
[1] "1000"           "genes done ..."
[1] "2000"           "genes done ..."
[1] "3000"           "genes done ..."
[1] "4000"           "genes done ..."
[1] "5000"           "genes done ..."
[1] "6000"           "genes done ..."
[1] "7000"           "genes done ..."
[1] "8000"           "genes done ..."
[1] "9000"           "genes done ..."
[1] "10000"          "genes done ..."
[1] "11000"          "genes done ..."
[1] "12000"          "genes done ..."

here's the routine:

arnova <- function(emat, factors, model, contr=NULL)
{
      genes <- rownames(emat) # gene names
      l <- length(genes) # number of genes
      difflabels <- attr(terms(model),"term.labels") # factors from model
	diff <- list() # stores an anova(lm) result
      coef <- list() # stores an summary(lm)

      ### loop over all rows (genes) and calculate the lm & anova,
	### the p-values and coefficients for each factor are stored
	for ( i in 1:l ) {

          ### the input data frame for this gene
          d <- data.frame(Value = emat[i, ], factors)
     	    fit <- lm(model, data = d, contrasts=contr)
          genecoef <- summary(fit)$coefficients[-1,4] # coefficients
          genediff <- as.vector(na.omit(anova(fit)$'Pr(>F)')) # anova p-values
          names(genediff) <- difflabels # the factor names
          diff[[genes[i]]] <- genediff
          coef[[genes[i]]] <- genecoef

          ###print progress
          if ( !(i%%1000) ) { print(c(i, 'genes done ...')) }
	}

	list(anova=diff, coefs=coef)
}

A result look slike this:

> dif.04h$'anova'[1:3]
$"100001_at"
     dose     batch 
0.5245583 0.2333894 

$"100002_at"
        dose        batch 
9.635027e-01 3.193364e-05 

$"100003_at"
        dose        batch 
7.374520e-01 4.639076e-06 

> dif.04h$'coef'[1:3]
$"100001_at"
dose010mM dose025mM dose050mM dose100mM  batchOLD  batchPRG 
0.6895026 0.2869963 0.7480160 0.4776234 0.1204537 0.1837618 

$"100002_at"
   dose010mM    dose025mM    dose050mM    dose100mM     batchOLD     batchPRG 
6.192953e-01 8.952266e-01 8.113848e-01 9.412216e-01 1.515069e-05 4.904563e-04 

$"100003_at"
   dose010mM    dose025mM    dose050mM    dose100mM     batchOLD     batchPRG 
9.574872e-01 8.735150e-01 4.739319e-01 2.870358e-01 9.635836e-06 1.549058e-05 

Any suggestions to speed this up?

	thanks a lot for your help,
	+regards,

	Arne

--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com