[BioC] Problem running summarizeOverlaps()

Jessica Perry Hekman hekman2 at illinois.edu
Wed May 21 04:08:27 CEST 2014


I used

bamfls <- BamFileList(fls, yieldSize=100000)
options(mc.cores=8)

somewhat arbitrarily. With these settings, on my server, 
summarizeOverlaps() takes less than ten minutes to run, so it seems 
efficient enough for my purposes, but I'd be curious to hear Martin's 
response to Ryan's question.

Martin: thanks so much for your help. I'm moving forward again. I 
imagine I'll be back as I proceed along the process, but your help has 
been much appreciated, as has your sense of humor.

Jessia


Jessica P. Hekman, DVM, MS
PhD student, University of Illinois, Urbana-Champaign
Animal Sciences / Genetics, Genomics, and Bioinformatics


On 05/20/2014 05:48 PM, Ryan Thompson wrote:
> Hi Martin, do you think there is a reasonable case for dividing the
> yield size by the number of simultaneous processes? That way the yield
> size would represent the total number of reads yielded across all
> processes at any given time.
>
> -Ryan
>
> On May 20, 2014 2:26 PM, "Martin Morgan" <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>     On 05/20/2014 01:37 PM, Jessica Perry Hekman wrote:
>
>         On 05/20/2014 02:20 PM, Jessica Perry Hekman wrote:
>
>                     Error: C stack usage is too close to the limit
>
>
>                 You might then try adding a 'yieldSize' argument to the
>                 following line,
>                 starting small (e.g., 100000) and moving toward the
>                 default (1000000) if
>                 the small size works when calling summarizeOverlaps, or
>                 perhaps smaller
>                 if it fails.
>
>                     bamfls <- BamFileList(fls, yieldSize=100000)
>
>
>         So, this is perplexing. Is 1000000 really the default? Because I
>         can set
>         yieldSize to much larger OR smaller than that and the command
>         will succeed (or
>         at least complete without errors). But when I do not specify
>         yieldSize at all,
>         there is an error!
>
>
>     Ok, I guess I did not remember correctly. If the function is passed
>     a BamFile / BamFileList, it respects the yieldSize in the File /
>     List. If yieldSize is not specified, then it'll try to read the
>     entire file into memory. And hilarity ensues. If passed a character
>     vector of file paths (I think this is supported in your version)
>     then summarizeOverlaps will set the default yieldSize to 1000000.
>
>     So yes, create the BamFileList with an appropriate yieldSize. From
>     your earlier email, yieldSize refers to the number of reads read in
>     at one time.
>
>     In terms of appropriate yieldSize, summarizeOverlaps will iterate
>     through individual bam files using yieldSize, and simultaneously use
>     parallel (hence for you mc.cores, which by default is just 2 but can
>     be set using options(mc.cores=8) or whatever; more recent versions
>     use BiocParallel and register(MulticoreParam())) evaluation to
>     process several bam files at once. So for optimal performance you
>     want to choose a yieldSize such that all cores (or as many as being
>     neighbourly dictates) are in use but not too much memory is being
>     consumed.
>
>     If you do decide to update your R, summarizeOverlaps has moved to
>     GenomicAlignments.
>
>     Martin
>
>
>           > bamfls <- BamFileList(fls, yieldSize=100000)
>           > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union",
>         +          ignore.strand=TRUE, single.end=TRUE, param=param)
>           > bamfls <- BamFileList(fls, yieldSize=500000)
>           > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union",
>         +          ignore.strand=TRUE, single.end=TRUE, param=param)
>           > bamfls <- BamFileList(fls, yieldSize=1000000)
>           > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union",
>         +          ignore.strand=TRUE, single.end=TRUE, param=param)
>           > bamfls <- BamFileList(fls, yieldSize=10000000)
>           > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union",
>         +          ignore.strand=TRUE, single.end=TRUE, param=param)
>           > bamfls <- BamFileList(fls, yieldSize=1000000000)
>           > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union",
>         +          ignore.strand=TRUE, single.end=TRUE, param=param)
>
>
>         BUT:
>
>           > bamfls <- BamFileList(fls)
>           > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union",
>         +          ignore.strand=TRUE, single.end=TRUE, param=param)
>         Error: C stack usage is too close to the limit
>
>         ?!
>
>         Jessica
>
>
>
>     --
>     Computational Biology / Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N.
>     PO Box 19024 Seattle, WA 98109
>
>     Location: Arnold Building M1 B861
>     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>     _________________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>     Search the archives:
>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>



More information about the Bioconductor mailing list