# [BioC] programming problem: running many ANOVAs

Arne.Muller at aventis.com Arne.Muller at aventis.com
Mon Apr 19 18:51:36 CEST 2004

```Hello,

I've a R/bioconductor programming question. For each gene in a matrix (emat) generated by exprs() I'm running an anova based on a goven model. The p-values for the anova and the coefficients from the lm for each gene are stored in a list and are returned.

A typical emat contains 12,000 rows and 40 colums or so. The probelm is that it runs about 10 minutes (P4 2.8GHz under linux with 2GB RAM) to complete.

I wonder if there is a more effifient way to program this routine (running the 12,000 lm's and ANOVAs). Is there a way to get rid of the generation of the data.frame that I pass to the lm?

I call my routine like this (dose and batch are two factors):

> dif.04h <- arnova(emat.04h, list(dose, batch), Value ~ dose + batch)
[1] "1000"           "genes done ..."
[1] "2000"           "genes done ..."
[1] "3000"           "genes done ..."
[1] "4000"           "genes done ..."
[1] "5000"           "genes done ..."
[1] "6000"           "genes done ..."
[1] "7000"           "genes done ..."
[1] "8000"           "genes done ..."
[1] "9000"           "genes done ..."
[1] "10000"          "genes done ..."
[1] "11000"          "genes done ..."
[1] "12000"          "genes done ..."

here's the routine:

arnova <- function(emat, factors, model, contr=NULL)
{
genes <- rownames(emat) # gene names
l <- length(genes) # number of genes
difflabels <- attr(terms(model),"term.labels") # factors from model
diff <- list() # stores an anova(lm) result
coef <- list() # stores an summary(lm)

### loop over all rows (genes) and calculate the lm & anova,
### the p-values and coefficients for each factor are stored
for ( i in 1:l ) {

### the input data frame for this gene
d <- data.frame(Value = emat[i, ], factors)
fit <- lm(model, data = d, contrasts=contr)
genecoef <- summary(fit)\$coefficients[-1,4] # coefficients
genediff <- as.vector(na.omit(anova(fit)\$'Pr(>F)')) # anova p-values
names(genediff) <- difflabels # the factor names
diff[[genes[i]]] <- genediff
coef[[genes[i]]] <- genecoef

###print progress
if ( !(i%%1000) ) { print(c(i, 'genes done ...')) }
}

list(anova=diff, coefs=coef)
}

A result look slike this:

> dif.04h\$'anova'[1:3]
\$"100001_at"
dose     batch
0.5245583 0.2333894

\$"100002_at"
dose        batch
9.635027e-01 3.193364e-05

\$"100003_at"
dose        batch
7.374520e-01 4.639076e-06

> dif.04h\$'coef'[1:3]
\$"100001_at"
dose010mM dose025mM dose050mM dose100mM  batchOLD  batchPRG
0.6895026 0.2869963 0.7480160 0.4776234 0.1204537 0.1837618

\$"100002_at"
dose010mM    dose025mM    dose050mM    dose100mM     batchOLD     batchPRG
6.192953e-01 8.952266e-01 8.113848e-01 9.412216e-01 1.515069e-05 4.904563e-04

\$"100003_at"
dose010mM    dose025mM    dose050mM    dose100mM     batchOLD     batchPRG
9.574872e-01 8.735150e-01 4.739319e-01 2.870358e-01 9.635836e-06 1.549058e-05

Any suggestions to speed this up?

thanks a lot for your help,
+regards,

Arne

--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com

```