[BioC] beadarray: Running BASH for 120 sections

Thu Apr 7 20:02:56 CEST 2011

Dear MIke, That does indeed work! Thank you, Gavin.

On 7 April 2011 17:32, Mike Smith <grimbough at gmail.com> wrote:
> Hi Gavin,
> Thanks for testing those arrays.  I'm surprised BASH uses quite so much
> memory.  It does a lot of processing, but the only significant output is a
> single vector of weights per section.
> Your current problem is due to us not foreseeing how people would read the
> data in.  In our lab we always read a chip at a time and beadarray tries to
> infer the SampleGroup from the .sdf file and the section names.  We didn't
> anticipate having multiple chips being read in one go and I suspect if you
> print beadData at sectionData$SampleGroup it'll will look something like:
> "A" "A" "B" "B" "C" "C" "D" "D" "E" "E" "F" "F" "A" "A" "B".....
> When you come to summarise, everything with the same letter will be
> summarised together, giving you 6 sets of results, each drawn from 20
> sections.  For now you can use an alternative sampleFactor in the
> summarize() function like this:
> melioid <- summarize(beadData,
>  list(greenchannel),
>  useSampleFac=TRUE,
>  sampleFac = rep(1:60, each = 2)
> )
> This should summarize only the two appropriate sections and give you back an
> ExpressionSetIllumina with 60 samples.
> I've modified the devel version of the package (2.1.18) to use a more robust
> SampleGroup based using the ChipID as well, so this shouldn't be a problem
> in the future.
> Mike
> On Thu, Apr 7, 2011 at 12:42 PM, Gavin Koh <gavin.koh at gmail.com> wrote:
>>
>> Dear Mike
>>
>> I think it is a memory problem. There is no problem with arrays 53 or
>> 54: it is just that at that point, BASH has eaten up all the RAM.
>>
>> Here's what I've done:
>>
>> BASHoutput_1 <- BASH(beadData, array=1:12)
>> rm(BASHoutput_1)
>> BASHoutput_2 <- BASH(beadData, array=13:24)
>> save(BASHoutput_2, file="BASHoutput_2.RData")
>> rm(BASHoutput_2)
>> BASHoutput_3 <- BASH(beadData, array=25:36)
>> save(BASHoutput_3, file="BASHoutput_3.RData")
>> ...etc.
>>
>> load(file="BASHoutput_1.RData")
>> beadData <- setWeights(beadData, BASHoutput_1$wts, array=1:12)
>> rm(BASHoutput_1)
>> load(file="BASHoutput_2.RData")
>> beadData <- setWeights(beadData, BASHoutput_2$wts, array=13:24)
>> rm(BASHoutput_2)
>> load(file="BASHoutput_3.RData")
>> beadData <- setWeights(beadData, BASHoutput_3$wts, array=25:36)
>> rm(BASHoutput_3)
>> ...etc.
>>
>> And this works fine.
>>
>> I now have a different problem.
>> The summarize function has read in the sdf's for each of the chips,
>> but it seems misunderstand what is going on: in other words, I have
>> 120 sections (60 samples), but after summarizing, I end up with only 6
>> sets of data and the array ID's assigned are for the first chip only.
>>
>> bead.mean <- function(x) mean(x, na.rm=TRUE)
>> bead.sd <- function(x) sd(x, na.rm=TRUE)
>> greenchannel <- new("illuminaChannel",
>>  logGreenChannelTransform,
>>  illuminaOutlierMethod, bead.mean, bead.sd, "G"
>> )
>> melioid <- summarize(beadData,
>>  list(greenchannel),
>>  useSampleFac=TRUE
>> )
>>
>> When I check the melioid object, I find that it only has 6 samples in it:
>> > dim(melioid)
>> Features  Samples Channels
>>   49576        6        1
>>
>> Does this mean that summarize has to be run on individual chips as well?
>>
>> Gavin.
>>
>> On 6 April 2011 13:13, Gavin Koh <gavin.koh at gmail.com> wrote:
>> > Thanks. Is it possible to setWeights() for parts of my beadData object
>> > instead? That would involve a lot less rewriting... G.
>> >
>> > On 6 April 2011 12:30, Mike Smith <grimbough at gmail.com> wrote:
>> >> Hi Gavin,
>> >>
>> >> I'm sending this to the BioC list as well so there's a record for
>> >> anyone
>> >> else looking for advice.
>> >>
>> >> The code you've suggested isn't quite what I had in mind.  In this
>> >> case, I
>> >> was suggesting the combine() function should be used on the beadData
>> >> objects
>> >> after the weights have been set.  Hopefully the pseudo code below will
>> >> be
>> >> understandable (anything in quotes needs to be specified for your
>> >> data).
>> >>
>> >> So on our cluster I would run multiple R sessions.  The I would read a
>> >> subset of the arrays in each session so:
>> >>
>> >> For sesssion 1:
>> >> beadData1 <- readIllumina("sections 1 : N")
>> >> For session 2:
>> >> beadData2 <- readIllumina("sections N+1 : 2N")
>> >> etc...
>> >>
>> >> I find sticking to a BeadChip per R session is convenient, mostly due
>> >> to the
>> >> folder structure produced by BeadScan.
>> >>
>> >> Then in each R session you can run something like the following:
>> >>
>> >> ..."Any pre-BASH processing you like"....
>> >> BASHoutput <- BASH(beadData, array = 1:N)
>> >> beadData <- setWeights(beadData, bashOutput$wts, array=1:N)
>> >> save(beadData, file = "beadData session N")
>> >> quit...
>> >>
>> >> I'd then open a new R session and load the various beadData objects.
>> >> You can then combine them with:
>> >> beadData <- combine(beadData1, beadData2).
>> >> If you have more than two you'll probably need a loop, I don't think
>> >> our
>> >> combine function takes more than two at a time, but it's worth checking
>> >> the
>> >> man page for that.
>> >>
>> >> You should then have one large beadData object with all the arrays and
>> >> BASH
>> >> weights.  As an alternative, you could skip the combining step, don't
>> >> close
>> >> the separate R session  and do any further processing right up to the
>> >> summarization step.  I think I'm right in saying none of the QC, QA etc
>> >> requires information between chips, so each can be process
>> >> independently.
>> >>
>> >> That's probably all a bit messy, but feel free to ask any more
>> >> questions.
>> >>
>> >> Mike
>> >>
>> >>
>> >> On Tue, Apr 5, 2011 at 11:25 PM, Gavin Koh <gavin.koh at gmail.com> wrote:
>> >>>
>> >>> Dear Mike,
>> >>> Thanks for replying so quickly.
>> >>> R exits and throws me back to the system prompt.
>> >>>
>> >>> I'll try running array 53 alone first to see if that is the problem.
>> >>> If that is not the problem, then I would like to try breaking it up
>> >>> into batches as you suggest.
>> >>> I've not used the bioBase combine() function before, but looking at
>> >>> the help file, I would think that I could do
>> >>>
>> >>> bashOutput1 <- BASH(beadData, array=1:12)
>> >>> bashOutput2 <- BASH(beadData, array=13:24)
>> >>> .
>> >>> .
>> >>> .
>> >>> bashOutput <- combine(bashOutput1, bashOutput2,...bashOutputn)
>> >>> beadData <- setWeights(beadData, bashOutput$wts, array=1:n)
>> >>>
>> >>> Am I right?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Gavin.
>> >>>
>> >>> On 5 April 2011 13:30, Mike Smith <grimbough at gmail.com> wrote:
>> >>> > Hi Gavin,
>> >>> >
>> >>> > I'm afraid that particular error means nothing to me.  Does R exit
>> >>> > too,
>> >>> > or
>> >>> > does BASH stop and return you to an interactive session?
>> >>> >
>> >>> > I found this very old post on R exit codes
>> >>> > (http://tolstoy.newcastle.edu.au/R/help/02b/3168.html), which may be
>> >>> > relevant but I'm speculating at the moment.
>> >>> >
>> >>> > Is there anything particularly unusual with the 53rd array?  If you
>> >>> > try
>> >>> > to
>> >>> > BASH that array in isolation e.g. BASHoutput <- BASH(beadData,
>> >>> > array=53)
>> >>> > does it proceed ok?
>> >>> >
>> >>> > If it is a memory problem then it may be worth waiting for the next
>> >>> > Bioconductor release in about a week.  I recently discovered a
>> >>> > memory
>> >>> > leak
>> >>> > and a small bug that could cause a segfault in the BASH C code,
>> >>> > which
>> >>> > I've
>> >>> > patched in the developmental version.  I conducted a test this
>> >>> > morning
>> >>> > with
>> >>> > 4 HumanV3 sections and the memory leak was about 100MB, which isn't
>> >>> > ideal,
>> >>> > but with a 16GB limit I'd have thought you'd have enough head room
>> >>> > not
>> >>> > to be
>> >>> > affected by it.
>> >>> >
>> >>> > Personally I've never tried to BASH so many sections in one go, but
>> >>> > there's
>> >>> > reason it shouldn't work (memory and time permitting).  What we tend
>> >>> > to
>> >>> > do
>> >>> > is read a smaller number of sections (say a single BeadChip) into an
>> >>> > R
>> >>> > session and process each in separately.  We then save each separate
>> >>> > object
>> >>> > once it's been processed, load them all into a new R session and use
>> >>> > the
>> >>> > combine() function to create a single beadLevelData object.  That
>> >>> > way it
>> >>> > can
>> >>> > be done in sort of coarse grained parallel.
>> >>> >
>> >>> > As far as making it parallel in a more friendly way, that's
>> >>> > something
>> >>> > we're
>> >>> > working on, but it's not an imminent release.
>> >>> >
>> >>> > I hope that's some help,
>> >>> >
>> >>> >
>> >>> > On Mon, Apr 4, 2011 at 9:51 PM, Gavin Koh <gavin.koh at gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> I have 60 samples which were run on an Illumina HumanWG-6 v3.0
>> >>> >> Expression BeadChip (so 120 sections) and I am doing the
>> >>> >> pre-processing using beadarray.
>> >>> >>
>> >>> >> I am trying to generate spatial masks using BASH(). I have
>> >>> >> successfully run a smaller analysis (one slide of 12 sections) on
>> >>> >> my
>> >>> >> MacBook OSX Snow Leopard with 4Gb RAM using beadarray 2.7.
>> >>> >>
>> >>> >> The command I used to call BASH was:
>> >>> >> BASHoutput <- BASH(beadData, array=1:n)
>> >>> >>
>> >>> >> I am running the full analysis (120 sections) on a computing
>> >>> >> cluster
>> >>> >> (lustre). I have only requested a single core with 16Gb RAM,
>> >>> >> because I
>> >>> >> don't know how to get BASH() to multithread (although in theory it
>> >>> >> ought to be possible? it is a repetitive process after all). I
>> >>> >> cannot
>> >>> >> get the script past 53 sections, without bash() terminating with
>> >>> >> exit
>> >>> >> code "user code 2". Doesn't matter if I am running it interactively
>> >>> >> or
>> >>> >> calling R CMD BATCH. I don't know what the exit code means, so I
>> >>> >> don't
>> >>> >> know how to fix it. I don't think it is out of memory, because
>> >>> >> lustre
>> >>> >> has other codes for reporting out-of-memory and R usually reports
>> >>> >> out-of-memory errors as "cannot allocate vector of size..."? Also,
>> >>> >> the
>> >>> >> previous time it ran out of memory (when I tried 6 Gb RAM), it was
>> >>> >> lustre that terminated the process.
>> >>> >>
>> >>> >> I don't know if the problem is that BASH() cannot handle so many
>> >>> >> sections. If that is in fact the problem, then there are two
>> >>> >> solutions
>> >>> >> I can think of: 1. get BASH() to run multithreaded, or 2. run
>> >>> >> BASH()
>> >>> >> on selected sections only.
>> >>> >>
>> >>> >> On inspection of the pseudoimages, I can see there are only two
>> >>> >> sections of the 120 with obvious spatial defects (they look like
>> >>> >> scratches). Is it possible to find outliers on the other sections
>> >>> >> using the usual (faster) method (>3MAD) and then just use BASH()
>> >>> >> for
>> >>> >> the two sections that are defective only? or...is there a tool to
>> >>> >> just
>> >>> >> draw the masks myself??
>> >>> >>
>> >>> >> Thanks in advance,
>> >>> >>
>> >>> >> Gavin
>> >>> >>
>> >>> >> sessionInfo() reports:
>> >>> >> R version 2.12.0 (2010-10-15)
>> >>> >> Platform: x86_64-unknown-linux-gnu (64-bit)
>> >>> >>
>> >>> >> locale:
>> >>> >>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>> >>> >>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>> >>> >>  [5] LC_MONETARY=C              LC_MESSAGES=C
>> >>> >>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
>> >>> >>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> >>> >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>> >>> >>
>> >>> >> attached base packages:
>> >>> >> [1] stats     graphics  grDevices utils     datasets  methods
>> >>> >> base
>> >>> >>
>> >>> >> other attached packages:
>> >>> >> [1] beadarray_2.0.6 Biobase_2.10.0
>> >>> >>
>> >>> >> loaded via a namespace (and not attached):
>> >>> >> [1] limma_3.6.6
>> >>> >>
>> >>> >> --
>> >>> >> Hofstadter's Law: It always takes longer than you expect, even when
>> >>> >> you take into account Hofstadter's Law.
>> >>> >> —Douglas Hofstadter (in Gödel, Escher, Bach, 1979)
>> >>> >>
>> >>> >> _______________________________________________
>> >>> >> Bioconductor mailing list
>> >>> >> Bioconductor at r-project.org
>> >>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>> >> Search the archives:
>> >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>> >
>> >>> >
>> >>> > --
>> >>> > Mike Smith
>> >>> > PhD Student
>> >>> > Computational Biology Group
>> >>> > Cambridge University
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Hofstadter's Law: It always takes longer than you expect, even when
>> >>> you take into account Hofstadter's Law.
>> >>> —Douglas Hofstadter (in Gödel, Escher, Bach, 1979)
>> >>
>> >>
>> >>
>> >> --
>> >> Mike Smith
>> >> PhD Student
>> >> Computational Biology Group
>> >> Cambridge University
>> >>
>> >
>> >
>> >
>> > --
>> > Hofstadter's Law: It always takes longer than you expect, even when
>> > you take into account Hofstadter's Law.
>> > —Douglas Hofstadter (in Gödel, Escher, Bach, 1979)
>> >
>>
>>
>>
>> --
>> Hofstadter's Law: It always takes longer than you expect, even when
>> you take into account Hofstadter's Law.
>> —Douglas Hofstadter (in Gödel, Escher, Bach, 1979)
>
>
>
> --
> Mike Smith
> PhD Student
> Computational Biology Group
> Cambridge University
>

-- 
Hofstadter's Law: It always takes longer than you expect, even when
you take into account Hofstadter's Law.
—Douglas Hofstadter (in Gödel, Escher, Bach, 1979)