[BioC] segfault ReadAffy cause 'memory not mapped'

Thu Aug 22 00:18:45 CEST 2013

On 8/1/13 5:33 PM, Loraine, Ann wrote:
> Hello,
> 
> I am trying to process several thousand CEL files using the ReadAffy command.
> 
> The machine has 96 Gb RAM.
> 
> However I get this error:
> 
> > expr=ReadAffy(filenames=d.uniq$cel,celfile.path='CEL',sampleNames=d.uniq$gsm,compress=T)
> 
>  *** caught segfault ***
> address 0x7fc79b4b1048, cause 'memory not mapped'
> 

I also have a problem loading many (3750) Affy hgu133plus2 arrays into
an AffyBatch. I was able to run this with ~2900 arrays, but not since
adding ~800 more. At right around 16 GiB allocated, I get a segfault
like:

 *** caught segfault ***
address 0x2aa6b6067048, cause 'memory not mapped'

Traceback:
 1: .Call("read_abatch", filenames, rm.mask, rm.outliers, rm.extra,     ref.cdfName, dim.intensity[c(1, 2)], verbose, PACKAGE = "affyio")
 2: read.affybatch(filenames = as.character(pdata$Filename))

I noticed this when trying to run justGCRMA() or justRMA(), which both
threw the same error. The traceback pointed to read.affybatch() so I
tried just doing that directly.

I first checked to make sure each file could be read in a loop, and they
all come in OK individually. However, if I try to read them all at once
I keep getting errors right around 16 GiB allocated (to R).

My laptop is Ubuntu Linux 12.04 with 32 GiB RAM, and I also tried this
on a 256 GiB RAM machine with RHEL5. Both were running R version 3.0.1.
On the Ubuntu machine, I was using affy v1.39.2, and on the RHEL5
machine it was affy v1.38.1.

In both cases the segfault came at about 16 GiB allocated (PBS epilogue
shows 15.41 GiB memory used when running on the 256 GiB machine via
batch submission). I also ran via an interactive PBS session on the 256
GiB server and the same error happened.

I had considered it could be a limit of the signed int indices for R
vectors/arrays, but I thought that had changed as of R v3.0. Also, I
thought that would give the error 'too many elements specified' rather
than a 'memory not mapped' segfault. I've certainly allocated close to
64 GiB to R doing other things with these data, I'm just not sure if any
individual vectors were that large.

I know there are ways to get around this. For example, I ran fRMA on
subsets (split it into 8 subsets) and then combined the expression sets.
Of course trying to run fRMA on the whole set at once failed as well.
The fRMA-summarized data just 'feel' a bit different though, and I've
been working with many of these arrays for a while now. (I know
'feelings' aren't statistics, so please don't scorch me on that!) Also,
I've seen the suggestions like aroma.* for large datasets.

However, this seems like something that should be possible using the
affy package given how cheap large memory systems are these days. I'm
expecting a 0.5 TiB RAM workstation this fall! Also, if there is some
kind of limitation in the implementation I think it's worth finding and
helping get fixed. Any thoughts on whether there is a limitation in the
affy package, in my gcc compiler, or something else? Would love for this
to be able to use all my RAM.

Below I included R output from one of my attempts.

Thanks!

Brian Peyser

$ R --vanilla
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(affy)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:stats’:

    xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
    duplicated, eval, Filter, Find, get, intersect, lapply, Map,
    mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rep.int, rownames, sapply, setdiff,
    sort, table, tapply, union, unique, unlist

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

> data <- read.affybatch(filenames=list.files(pattern=".CEL$", ignore.case=TRUE))
 *** caught segfault ***
address 0x7f60734e7048, cause 'memory not mapped'

Traceback:
 1: .Call("read_abatch", filenames, rm.mask, rm.outliers, rm.extra,     ref.cdfName, dim.intensity[c(1, 2)], verbose, PACKAGE = "affyio")
 2: read.affybatch(filenames = as.character(pdata$Filename))

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 

-- 
Brian D. Peyser PhD
Special Assistant to the Associate Director
Office of the Associate Director
Developmental Therapeutics Program
Division of Cancer Treatment and Diagnosis
National Cancer Institute
National Institutes of Health
301-524-5587 (mobile)