[R] help with memory greedy storage

Sat May 15 02:56:20 CEST 2004

Real rough estimate ... looks like you're trying to store about 38 million
numbers in the data frame.  Do you need all of the models in the dataframe
at the end or are you just trying to generate the output and look at it
later?

Perhaps you could save intermediate results to file, that is, create a
separate file for each gene model, or after each set of n gene models.

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch]On Behalf Of
Arne.Muller at aventis.com
Sent: Friday, May 14, 2004 19:45
To: r-help at stat.math.ethz.ch
Subject: [R] help with memory greedy storage

Hello,

I've a problem with a self written routine taking a lot of memory (>1.2Gb).
Maybe you can suggest some enhancements, I'm pretty sure that my
implementation is not optimal ...

I'm creating many linear models and store coefficients, anova p-values ...
all I need in different lists which are then finally returned in a list
(list of lists).

The input is a matrix with 84 rows and >100,000 rows. The routine probeDf
below creates a data frame that assigns the 84 rows to the different
factors, but not just for one row but for several rows, depending what
which(rows == g),] returns, and a new factor ('probe') is generated. This
results in a 1344 by 6 data frame.

Example data frame returned by probeDf:

       Value batch time  dose array probe
1   2.317804   NEW  24h 000mM     1     1
2   2.495390   NEW  24h 000mM     2     1
3   2.412247   NEW  24h 000mM     3     1
...
144 8.851469   OLD  04h 100mM    60     2
145 8.801430   PRG  24h 000mM    61     2
146 8.308224   PRG  24h 000mM    62     2
...

This data frame is not the problem since, it gets generated on-the-fly per
gene and is discarded afterwards (just that it takes some time to generate
it).

Here comes the problematic routine:

### emat: matrix, model: formular for lm, contr: optional contrasts
probe.fit <- function(emat, factors, model, contr=NULL)
{
        rows <- rownames(emat)
        genes <- unique(rows)
        l <- length(genes)
        ### generate proper lables (names) for the anova p-values
        difflabels <- attr(terms(model),"term.labels")
	  aov    <- list() # anova p-values for factors + interactions
        coef   <- list() # lm coefficients
        coefp  <- list() # p-valuies for coefficients
        rsq    <- list() # R-squared of fit
        fitted <- list() # fitted values
        value  <- list() # orig. values (used with fitted to get residuals)

	  for ( g in genes ) { # loop over >12,000 genes
          ### g is the name that identifies 14 to 16 rows in emat
          ### d is the data frame for the lm
          d <- probeDf(emat[which(rows == g),], facts)
          fit <- lm(model, data = d, contrasts=contr)
          fit.sum <- summary(fit)
          aov[[g]]   <- as.vector(na.omit(anova(fit)$'Pr(>F)'))
          names(aov[[g]]) <- difflabels
          coef[[g]]   <- coef(fit)[-1]
          coefp[[g]]  <- coef(fit.sum)[-1,'Pr(>|t|)']
          rsq[[g]]    <- fit.sum$'r.squared'
          value[[g]] <- d$Value
          fitted[[g]] <- fitted(fit)
	}
      list(aov=aov, coefs=coef, coefp=coefp, rsq=rsq,
           fitted=fitted, values=values)
}

### create a data frame from a matrix (usually 16 rows and 84 columns)
### and a list of factors. Basically this repates the factors 16 times
### (for each row in the matrix). This results in a data frame with 84*16
### rows as many columns as there are factors + 2 (probe factor + value
### to be modeled later)
probeDf <- function(emat, facts) {
    df <- NULL
    n <- 1
    nsamp <- ncol(emat)
    for ( i in 1:nrow(emat) ) {
        values <- c(t(emat[i,]))
        df.new <- data.frame(Value = values, facts, probe = rep(n, nsamp))
        n <- n + 1
        if ( !is.null(df) ) {
           df <- rbind(df, df.new)
        } else {
           df <- df.new
        }
    }
    df$probe <- as.factor(df$probe)
    df
}

If I remove coef, coefp, value and fitted from the loop in probe.fit the
memory usage is moderate.

The problem is that each of the 12,000 genes contributes 148 coefficients
(the model contains quite a few factors) and p-values, the fitted and value
vectors are >1300 elements long. I couldn't find a more compact form of
storage that I is still easy to explore afterwards.

Suggestions on how to get this done more efficiently (in terms of memory)
are greatfully received.

     kind regards,

     Arne

--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html