[R] Memory leak with character arrays?

Wed Jan 17 22:54:31 CET 2007

Hi -

When I'm trying to read in a text file into a labeled character array, 
the memory stamp/footprint of R will exceed 4 gigs or more.  I've seen 
this behavior on Mac OS X, Linux for AMD_64 and X86_64., and the R 
versions are 2.4, 2.4 and 2.2, respectively.  So, it would seem that 
this is platform and R version independant.

The file that I'm reading contains the upstream regions of the yeast 
genome, with each upstream region labeled using a FASTA header, i.e.:

    FASTA header for gene 1
    upstream region.....
    .....
    ....
    FASTA header for gene 2
    upstream....
    ....

The script I use - code below - opens the file, parses for a FASTA 
header, and then parses the header for the gene name.  Once this is 
done, it reads the following lines which contain the upstream region, 
and then adds it as an item to the character array, using the gene name 
as the name of the item it adds.  And then continues on to the following 
genes.

Each upstream region (the text to be added) is 550 bases (characters) 
long.  With ~6000 genes in the file I'm reading it, this would be 550 * 
6000 * 8 (if we're using ascii chars) ~= 25 Megs (if we're using ascii 
chars).

I realize that the character arrays/vectors will have a higher memory 
stamp b/c they are a named array and most likely aren't storing the text 
as ascii, but 4 gigs and up seems a bit excessive.  Or is it?

For an example, this is the output of top, at the point which R has 
processed around 5000 genes:

      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 
     4969 waltman   18   0 *6746m 3.4g*  920 D  2.7 88.2  19:09.19 R    

Is this expected behavior?  Can anyone recommend a less memory intensive 
way to store this data?  The relevant code that reads in the file follows:

     ....code....
         lines <- readLines( gzfile( seqs.fname ) )

          n.seqs <- 0

          upstream <- gene.names <- character()
          syn <- character( 0 )
          gene.start <- gene.end <- integer()
          gene <- seq <- ""

          for ( i in 1:length( lines ) ) {
            line <- lines[ i ]
            if ( line == "" ) next
            if ( substr( line, 1, 1 ) == ">" ) {

              if ( seq != "" && gene != "" ) upstream[ gene ] <-
    toupper( seq )
              splitted <- strsplit( line, "\t" )[[ 1 ]]
              splitted <- strsplit( splitted[ 1 ], ";\\ " )[[ 1 ]]
              gene <- toupper( substr( splitted[ 1 ], 2, nchar(
    splitted[ 1 ] ) ) )
              syn <- splitted[ 2 ]
              if ( ! is.null( syn ) &&
                  length( grep( valid.gene.regexp, gene, perl=T ) ) == 0 &&
                  length( grep( valid.gene.regexp, syn, perl=T ) ) == 1
    ) gene <- syn
              else if ( length( grep( valid.gene.regexp, gene, perl=T,
    ignore.case=T ) ) == 0 &&
                       length( grep( valid.gene.regexp, syn, perl=T,
    ignore.case=T ) ) == 0 ) next
              gene.start[ gene ] <- as.integer( splitted[ 9 ] )
              gene.end[ gene ] <- as.integer( splitted[ 10 ] )
              if ( n.seqs %% 100 == 0 ) cat.new( n.seqs, gene, "|", syn,
    "| length=", nchar( seq ),
                               gene.end[gene]-gene.start[gene]+1,"\n" )
              if ( ! is.na( syn ) && syn != "" ) gene.names[ gene ] <- syn
              else gene.names[ gene ] <- toupper( gene )
              n.seqs <- n.seqs + 1
              seq <- ""
            } else {
              seq <- paste( seq, line, sep="" )
            }
          }
          if ( seq != "" && gene != "" ) upstream[ gene ] <- toupper( seq )

     ....code....

Thanks,

Peter Waltman