[BioC] beadarray: Running BASH for 120 sections

Gavin Koh gavin.koh at gmail.com
Thu Apr 7 13:42:20 CEST 2011


Dear Mike

I think it is a memory problem. There is no problem with arrays 53 or
54: it is just that at that point, BASH has eaten up all the RAM.

Here's what I've done:

BASHoutput_1 <- BASH(beadData, array=1:12)
rm(BASHoutput_1)
BASHoutput_2 <- BASH(beadData, array=13:24)
save(BASHoutput_2, file="BASHoutput_2.RData")
rm(BASHoutput_2)
BASHoutput_3 <- BASH(beadData, array=25:36)
save(BASHoutput_3, file="BASHoutput_3.RData")
...etc.

load(file="BASHoutput_1.RData")
beadData <- setWeights(beadData, BASHoutput_1$wts, array=1:12)
rm(BASHoutput_1)
load(file="BASHoutput_2.RData")
beadData <- setWeights(beadData, BASHoutput_2$wts, array=13:24)
rm(BASHoutput_2)
load(file="BASHoutput_3.RData")
beadData <- setWeights(beadData, BASHoutput_3$wts, array=25:36)
rm(BASHoutput_3)
...etc.

And this works fine.

I now have a different problem.
The summarize function has read in the sdf's for each of the chips,
but it seems misunderstand what is going on: in other words, I have
120 sections (60 samples), but after summarizing, I end up with only 6
sets of data and the array ID's assigned are for the first chip only.

bead.mean <- function(x) mean(x, na.rm=TRUE)
bead.sd <- function(x) sd(x, na.rm=TRUE)
greenchannel <- new("illuminaChannel",
  logGreenChannelTransform,
  illuminaOutlierMethod, bead.mean, bead.sd, "G"
)
melioid <- summarize(beadData,
  list(greenchannel),
  useSampleFac=TRUE
)

When I check the melioid object, I find that it only has 6 samples in it:
> dim(melioid)
Features  Samples Channels
   49576        6        1

Does this mean that summarize has to be run on individual chips as well?

Gavin.

On 6 April 2011 13:13, Gavin Koh <gavin.koh at gmail.com> wrote:
> Thanks. Is it possible to setWeights() for parts of my beadData object
> instead? That would involve a lot less rewriting... G.
>
> On 6 April 2011 12:30, Mike Smith <grimbough at gmail.com> wrote:
>> Hi Gavin,
>>
>> I'm sending this to the BioC list as well so there's a record for anyone
>> else looking for advice.
>>
>> The code you've suggested isn't quite what I had in mind.  In this case, I
>> was suggesting the combine() function should be used on the beadData objects
>> after the weights have been set.  Hopefully the pseudo code below will be
>> understandable (anything in quotes needs to be specified for your data).
>>
>> So on our cluster I would run multiple R sessions.  The I would read a
>> subset of the arrays in each session so:
>>
>> For sesssion 1:
>> beadData1 <- readIllumina("sections 1 : N")
>> For session 2:
>> beadData2 <- readIllumina("sections N+1 : 2N")
>> etc...
>>
>> I find sticking to a BeadChip per R session is convenient, mostly due to the
>> folder structure produced by BeadScan.
>>
>> Then in each R session you can run something like the following:
>>
>> ..."Any pre-BASH processing you like"....
>> BASHoutput <- BASH(beadData, array = 1:N)
>> beadData <- setWeights(beadData, bashOutput$wts, array=1:N)
>> save(beadData, file = "beadData session N")
>> quit...
>>
>> I'd then open a new R session and load the various beadData objects.
>> You can then combine them with:
>> beadData <- combine(beadData1, beadData2).
>> If you have more than two you'll probably need a loop, I don't think our
>> combine function takes more than two at a time, but it's worth checking the
>> man page for that.
>>
>> You should then have one large beadData object with all the arrays and BASH
>> weights.  As an alternative, you could skip the combining step, don't close
>> the separate R session  and do any further processing right up to the
>> summarization step.  I think I'm right in saying none of the QC, QA etc
>> requires information between chips, so each can be process independently.
>>
>> That's probably all a bit messy, but feel free to ask any more questions.
>>
>> Mike
>>
>>
>> On Tue, Apr 5, 2011 at 11:25 PM, Gavin Koh <gavin.koh at gmail.com> wrote:
>>>
>>> Dear Mike,
>>> Thanks for replying so quickly.
>>> R exits and throws me back to the system prompt.
>>>
>>> I'll try running array 53 alone first to see if that is the problem.
>>> If that is not the problem, then I would like to try breaking it up
>>> into batches as you suggest.
>>> I've not used the bioBase combine() function before, but looking at
>>> the help file, I would think that I could do
>>>
>>> bashOutput1 <- BASH(beadData, array=1:12)
>>> bashOutput2 <- BASH(beadData, array=13:24)
>>> .
>>> .
>>> .
>>> bashOutput <- combine(bashOutput1, bashOutput2,...bashOutputn)
>>> beadData <- setWeights(beadData, bashOutput$wts, array=1:n)
>>>
>>> Am I right?
>>>
>>> Thanks,
>>>
>>> Gavin.
>>>
>>> On 5 April 2011 13:30, Mike Smith <grimbough at gmail.com> wrote:
>>> > Hi Gavin,
>>> >
>>> > I'm afraid that particular error means nothing to me.  Does R exit too,
>>> > or
>>> > does BASH stop and return you to an interactive session?
>>> >
>>> > I found this very old post on R exit codes
>>> > (http://tolstoy.newcastle.edu.au/R/help/02b/3168.html), which may be
>>> > relevant but I'm speculating at the moment.
>>> >
>>> > Is there anything particularly unusual with the 53rd array?  If you try
>>> > to
>>> > BASH that array in isolation e.g. BASHoutput <- BASH(beadData, array=53)
>>> > does it proceed ok?
>>> >
>>> > If it is a memory problem then it may be worth waiting for the next
>>> > Bioconductor release in about a week.  I recently discovered a memory
>>> > leak
>>> > and a small bug that could cause a segfault in the BASH C code, which
>>> > I've
>>> > patched in the developmental version.  I conducted a test this morning
>>> > with
>>> > 4 HumanV3 sections and the memory leak was about 100MB, which isn't
>>> > ideal,
>>> > but with a 16GB limit I'd have thought you'd have enough head room not
>>> > to be
>>> > affected by it.
>>> >
>>> > Personally I've never tried to BASH so many sections in one go, but
>>> > there's
>>> > reason it shouldn't work (memory and time permitting).  What we tend to
>>> > do
>>> > is read a smaller number of sections (say a single BeadChip) into an R
>>> > session and process each in separately.  We then save each separate
>>> > object
>>> > once it's been processed, load them all into a new R session and use the
>>> > combine() function to create a single beadLevelData object.  That way it
>>> > can
>>> > be done in sort of coarse grained parallel.
>>> >
>>> > As far as making it parallel in a more friendly way, that's something
>>> > we're
>>> > working on, but it's not an imminent release.
>>> >
>>> > I hope that's some help,
>>> >
>>> >
>>> > On Mon, Apr 4, 2011 at 9:51 PM, Gavin Koh <gavin.koh at gmail.com> wrote:
>>> >>
>>> >> I have 60 samples which were run on an Illumina HumanWG-6 v3.0
>>> >> Expression BeadChip (so 120 sections) and I am doing the
>>> >> pre-processing using beadarray.
>>> >>
>>> >> I am trying to generate spatial masks using BASH(). I have
>>> >> successfully run a smaller analysis (one slide of 12 sections) on my
>>> >> MacBook OSX Snow Leopard with 4Gb RAM using beadarray 2.7.
>>> >>
>>> >> The command I used to call BASH was:
>>> >> BASHoutput <- BASH(beadData, array=1:n)
>>> >>
>>> >> I am running the full analysis (120 sections) on a computing cluster
>>> >> (lustre). I have only requested a single core with 16Gb RAM, because I
>>> >> don't know how to get BASH() to multithread (although in theory it
>>> >> ought to be possible? it is a repetitive process after all). I cannot
>>> >> get the script past 53 sections, without bash() terminating with exit
>>> >> code "user code 2". Doesn't matter if I am running it interactively or
>>> >> calling R CMD BATCH. I don't know what the exit code means, so I don't
>>> >> know how to fix it. I don't think it is out of memory, because lustre
>>> >> has other codes for reporting out-of-memory and R usually reports
>>> >> out-of-memory errors as "cannot allocate vector of size..."? Also, the
>>> >> previous time it ran out of memory (when I tried 6 Gb RAM), it was
>>> >> lustre that terminated the process.
>>> >>
>>> >> I don't know if the problem is that BASH() cannot handle so many
>>> >> sections. If that is in fact the problem, then there are two solutions
>>> >> I can think of: 1. get BASH() to run multithreaded, or 2. run BASH()
>>> >> on selected sections only.
>>> >>
>>> >> On inspection of the pseudoimages, I can see there are only two
>>> >> sections of the 120 with obvious spatial defects (they look like
>>> >> scratches). Is it possible to find outliers on the other sections
>>> >> using the usual (faster) method (>3MAD) and then just use BASH() for
>>> >> the two sections that are defective only? or...is there a tool to just
>>> >> draw the masks myself??
>>> >>
>>> >> Thanks in advance,
>>> >>
>>> >> Gavin
>>> >>
>>> >> sessionInfo() reports:
>>> >> R version 2.12.0 (2010-10-15)
>>> >> Platform: x86_64-unknown-linux-gnu (64-bit)
>>> >>
>>> >> locale:
>>> >>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>> >>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>> >>  [5] LC_MONETARY=C              LC_MESSAGES=C
>>> >>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
>>> >>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>> >>
>>> >> attached base packages:
>>> >> [1] stats     graphics  grDevices utils     datasets  methods   base
>>> >>
>>> >> other attached packages:
>>> >> [1] beadarray_2.0.6 Biobase_2.10.0
>>> >>
>>> >> loaded via a namespace (and not attached):
>>> >> [1] limma_3.6.6
>>> >>
>>> >> --
>>> >> Hofstadter's Law: It always takes longer than you expect, even when
>>> >> you take into account Hofstadter's Law.
>>> >> —Douglas Hofstadter (in Gödel, Escher, Bach, 1979)
>>> >>
>>> >> _______________________________________________
>>> >> Bioconductor mailing list
>>> >> Bioconductor at r-project.org
>>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> >> Search the archives:
>>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> >
>>> >
>>> > --
>>> > Mike Smith
>>> > PhD Student
>>> > Computational Biology Group
>>> > Cambridge University
>>> >
>>>
>>>
>>>
>>> --
>>> Hofstadter's Law: It always takes longer than you expect, even when
>>> you take into account Hofstadter's Law.
>>> —Douglas Hofstadter (in Gödel, Escher, Bach, 1979)
>>
>>
>>
>> --
>> Mike Smith
>> PhD Student
>> Computational Biology Group
>> Cambridge University
>>
>
>
>
> --
> Hofstadter's Law: It always takes longer than you expect, even when
> you take into account Hofstadter's Law.
> —Douglas Hofstadter (in Gödel, Escher, Bach, 1979)
>



-- 
Hofstadter's Law: It always takes longer than you expect, even when
you take into account Hofstadter's Law.
—Douglas Hofstadter (in Gödel, Escher, Bach, 1979)



More information about the Bioconductor mailing list