[BioC] Memory problem with rma()

cstrato cstrato at aon.at
Mon Feb 17 15:23:29 CET 2014


Dear Damian,

In principle you should not have a memory problem, however 5500 exon 
arrays is quite a lot, thus let me propose the following:

1. Do not run function rma() directly, but do it stepwise, i.e.:

data.bg.rma <- bgcorrect.rma(data.exon, ...)
data.qu.rma <- normalize.quantiles(data.bg.rma, ...)
data.mp.rma <- summarize.rma(data.qu.rma, ...)

You can find an example in script examples/script4exon.R (at line 750).
In this way you will not loose all your computation if anything goes 
wrong at one step.
Maybe you will also need to to set 'add.data=FALSE' in summarize.rma() 
otherwise all expression data will be imported causing a memory problem, 
too.

Another way to run rma() stepwise is to use function express(), see 
example in script examples/script4exon.R (at line 785). When using 
function express you could set parameter 'bufsize=4000', which will 
reduce the basket size for each tree, thus consuming less RAM.

2. I would suggest to use first only 6 exon arrays to see if everything 
works fine, then I would try to run 50 exon arrays to see if
- there is an initial memory problem
- to estimate how long each step needs if you run all 5500 arrays
   (approximately time x 110)

3. Please run everything with 'verbose=TRUE' so that you can see the 
output interactively. Maybe you could pipe the output to a text file.

4. Since you assume that there may be a memory problem: maybe you can 
run top (or something else) and check RSIZE/VSIZE from time to time. 
Maybe you can create a script which export the memory consumption e.g. 
every 10 min.

4. I am not sure if running the code on a cluster is a good idea.
Do you run your code on a node which is exclusively used for this 
purpose only?
My suggestion would be to run your code on a machine where nothing else 
is running, since I assume that for 5500 exon arrays you will need at 
least one week (but see point 2).

(Note: In 2009 a customer was running 23000 HGU-133_Plus2 arrays on a 
machine and with his help I could eliminate (hopefully) all memory 
problems, some of which appeared after 2000 arrays only. In his case 
memory consumption initially increased to 7.8 GB but after solving the 
memory problems memory consumption remained at 3.0 GB.)

Best regards,
Christian
_._._._._._._._._._._._._._._._._._
C.h.r.i.s.t.i.a.n   S.t.r.a.t.o.w.a
V.i.e.n.n.a           A.u.s.t.r.i.a
e.m.a.i.l:        cstrato at aon.at
_._._._._._._._._._._._._._._._._._



On 2/16/14 8:07 PM, Damian Plichta [guest] wrote:
> Hi,
>
> I am running rma() to correct, normalize and summarize a batch of ca. 5500 arrays. I have currently a memory limit of 8gb and the procedures exceeds that. I am guessing that it breaks at the background correction step. I investigated the temporary directory and it's only file called tmp_310151_rbg.root that was modified (size of that file is 16gb).  I attached the code below.
>
> I tried the latest ROOT version and the one recommended at bioconductor (root_v5.34.14,root_v5.34.05).
>
> Any idea why is there the memory issue?
>
> scheme.HuEx <- import.exon.scheme(
> 		filename = "Scheme_HuEx-1_0v2r2_hg19",
> 		layoutfile = "affyHuExome_design/HuEx-1_0-st-v2.r2.clf",
> 		schemefile = "affyHuExome_design/HuEx-1_0-st-v2.r2.pgf",
> 		probeset = "affyHuExome_design/HuEx-1_0-st-v2.na33.1.hg19.probeset.csv",
> 		transcript = "affyHuExome_design/HuEx-1_0-st-v2.na33.1.hg19.transcript.csv")
>
> scheme.HuEx <- root.scheme("Scheme_HuEx-1_0v2r2_hg19.root")
>
> data.HuEx <- import.data(
> 		scheme.HuEx,
> 		filename = "fhsCEL",
> 		filedir = "normalizationXPS/",
> 		celdir = "expression_CEL_raw/"
> 		)
>
> data.HuEx <- root.data(scheme.HuEx, rootfile="fhsCEL_cel.root") 		
> 		
> rma.HuEx.transcript <- rma(data.HuEx, filename="HuEx_RMAquantile",
> 		filedir="normalizationXPS",
> 		tmpdir = "normalizationXPS/tmpDir",
> 		add.data=FALSE, background="antigenomic", normalize=TRUE,
> 		option="transcript", exonlevel="core")
>
>
>   -- output of sessionInfo():
>
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=C                 LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] xps_1.22.2
>
> loaded via a namespace (and not attached):
> [1] tools_3.0.2
>
> --
> Sent via the guest posting facility at bioconductor.org.
>



More information about the Bioconductor mailing list