[R] help with memory greedy storage

Arne.Muller@aventis.com Arne.Muller at aventis.com
Sat May 15 01:44:55 CEST 2004


Hello,

I've a problem with a self written routine taking a lot of memory (>1.2Gb). Maybe you can suggest some enhancements, I'm pretty sure that my implementation is not optimal ...

I'm creating many linear models and store coefficients, anova p-values ... all I need in different lists which are then finally returned in a list (list of lists).

The input is a matrix with 84 rows and >100,000 rows. The routine probeDf below creates a data frame that assigns the 84 rows to the different factors, but not just for one row but for several rows, depending what which(rows == g),] returns, and a new factor ('probe') is generated. This results in a 1344 by 6 data frame.

Example data frame returned by probeDf:

       Value batch time  dose array probe
1   2.317804   NEW  24h 000mM     1     1
2   2.495390   NEW  24h 000mM     2     1
3   2.412247   NEW  24h 000mM     3     1
...
144 8.851469   OLD  04h 100mM    60     2
145 8.801430   PRG  24h 000mM    61     2
146 8.308224   PRG  24h 000mM    62     2
...

This data frame is not the problem since, it gets generated on-the-fly per gene and is discarded afterwards (just that it takes some time to generate it).

Here comes the problematic routine:

### emat: matrix, model: formular for lm, contr: optional contrasts
probe.fit <- function(emat, factors, model, contr=NULL)
{
        rows <- rownames(emat)
        genes <- unique(rows)
        l <- length(genes) 
        ### generate proper lables (names) for the anova p-values
        difflabels <- attr(terms(model),"term.labels")
	  aov    <- list() # anova p-values for factors + interactions
        coef   <- list() # lm coefficients
        coefp  <- list() # p-valuies for coefficients
        rsq    <- list() # R-squared of fit
        fitted <- list() # fitted values
        value  <- list() # orig. values (used with fitted to get residuals)

	  for ( g in genes ) { # loop over >12,000 genes
          ### g is the name that identifies 14 to 16 rows in emat
          ### d is the data frame for the lm
          d <- probeDf(emat[which(rows == g),], facts)
          fit <- lm(model, data = d, contrasts=contr)
          fit.sum <- summary(fit)
          aov[[g]]   <- as.vector(na.omit(anova(fit)$'Pr(>F)'))
          names(aov[[g]]) <- difflabels
          coef[[g]]   <- coef(fit)[-1]
          coefp[[g]]  <- coef(fit.sum)[-1,'Pr(>|t|)']
          rsq[[g]]    <- fit.sum$'r.squared'
          value[[g]] <- d$Value
          fitted[[g]] <- fitted(fit)
	}
      list(aov=aov, coefs=coef, coefp=coefp, rsq=rsq,
           fitted=fitted, values=values)
}

### create a data frame from a matrix (usually 16 rows and 84 columns)
### and a list of factors. Basically this repates the factors 16 times
### (for each row in the matrix). This results in a data frame with 84*16
### rows as many columns as there are factors + 2 (probe factor + value
### to be modeled later)
probeDf <- function(emat, facts) {
    df <- NULL
    n <- 1
    nsamp <- ncol(emat)
    for ( i in 1:nrow(emat) ) {
        values <- c(t(emat[i,]))
        df.new <- data.frame(Value = values, facts, probe = rep(n, nsamp))
        n <- n + 1
        if ( !is.null(df) ) {
           df <- rbind(df, df.new)
        } else {
           df <- df.new
        }
    }
    df$probe <- as.factor(df$probe)
    df
}

If I remove coef, coefp, value and fitted from the loop in probe.fit the memory usage is moderate.

The problem is that each of the 12,000 genes contributes 148 coefficients (the model contains quite a few factors) and p-values, the fitted and value vectors are >1300 elements long. I couldn't find a more compact form of storage that I is still easy to explore afterwards.

Suggestions on how to get this done more efficiently (in terms of memory) are greatfully received.

     kind regards,

     Arne

--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com




More information about the R-help mailing list