[BioC] Maximum number of CEL files for ReadAffy() in Affy package.

Ben Bolstad bmb at bmbolstad.com
Wed Jul 23 02:46:14 CEST 2008


I am going to answer this question only with regards to RMA. For dChip I
refer you to the dChip software, any BioC implementation is likely to be
inefficient, potentially inaccurate and almost certainly unsynchronised
with the current algorithm.  Furthermore, I'm only going to speak with
respect to those solutions for which I am fully or partly responsible
(with all due respect to the authors of aroma.affymetrix, xps etc that
have their own fine large scale data solutions).

Any solution directly involving AffyBatch objects will be the most
memory hungry.  This is the ReadAffy()/rma() route. All intensities, PM,
MM or otherwise are read into RAM.

The next most memory efficient route is justRMA(). This reads directly
only the PM intensities into RAM, forms no AffyBatch, but does the
correct processing to get RMA expression values.

BufferedMatrixMethods offers BufferedMatrix.justRMA() which will keep
only a minimal amount of probe intensity data in active memory.
Otherwise it act's pretty much like the normal justRMA().

RMAExpress offers a point and click GUI application which also keeps a
minimal amount of probe intensity data in memory. But it is not BioC or
R based so I don't go out of my way to advertise it to this mailing list
(apologies). I have had a user report processing over 10,000 arrays
using it.

Some runtime testing (up to 2500 HGU-133 Plus 2.0 arrays) of
BufferedMatrix.justRMA and RMAExpress is here:
http://bmbolstad.com/software/BufferedMatrixMethodsTests/index.html

Multiple processors/cores will not help you very much with RAM usage,
though it could help on runtime performance for the RMA()/justRMA().
This will only be true if you've built the package from source on a
system with pthreads support and the environment variable R_THREADS is
set. See http://bmbolstad.com/software/preprocessCoreTests/index.html
for simulations of the quantile normalization part of the code using
multiple threads on a dual core machine. I think on the current release
versions justRMA() has threaded parsing, background correction and
normalization, threaded summarization may only be in the devel branch.

Best,

Ben


On Tue, 2008-07-22 at 16:04 -0700, Hailong Cui wrote:
> Dear all,
> 
> First, I apologize for the mass email. I've been reading manuals, googling,
> searching the archive of the mailing list, but still cannot find an exact
> answer to my problem.
> 
> (1) Question: Can a large number of CEL files cause an overflow for the
> function ReadAffy() in the affy packages? Is there any way to fix this?
> Other options seem to be other software RMAExpress and dChip in WindowsXP.
> Any suggestions?
> 
> (2) Background: What I am trying to do is to read in all the CEL files in
> the directory to create an AffyBatch object, so that I can use functions in
> the affy package. To be more specific, I want to do RMA, dChip normalization
> and get MAplots. In my workstation (48 64-bit CPUs, 500Gb memory),
> ReadAffy() worked fine for 241 CEL files, but when I moved on to 2,035 CEL
> files, it failed and kept showing the error message below. The number of
> rows for the CEL file is roughly 50k. On the bright side, I tried justRMA()
> and got the expression values in the text format.
> 
> > R
> > library(affy)
> > Data <- ReadAffy()
> Error in read.affybatch(filenames = l$filenames, phenoData
> = l$phenoData,  :
>   allocMatrix: too many elements specified
> 
> 
> FYI, below is the session information on my workstation.
> 
> > sessionInfo()
> R version 2.7.1 (2008-06-23)
> ia64-unknown-linux-gnu
> 
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] tools     stats     graphics  grDevices utils     datasets  methods
> [8] base
> 
> other attached packages:
>  [1] geneplotter_1.18.0          annotate_1.18.0
>  [3] xtable_1.5-2                AnnotationDbi_1.2.2
>  [5] RSQLite_0.6-9               DBI_0.2-4
>  [7] lattice_0.17-8              BufferedMatrixMethods_1.4.0
>  [9] BufferedMatrix_1.4.0        affy_1.18.2
> [11] preprocessCore_1.2.0        affyio_1.8.0
> [13] Biobase_2.0.1
> 
> loaded via a namespace (and not attached):
> [1] grid_2.7.1         KernSmooth_2.22-22 RColorBrewer_1.0-2
> 
> 
> 
> 
> Thank you so much for reading this and I would appreciate your reply.
> 
> Hailong
> 
>



More information about the Bioconductor mailing list