[BioC] yieldSize and paired end data in SummarizeOverlaps error?

Fri Mar 15 17:43:40 CET 2013

Hi Michael and Ryan,

Yes, all QNAME records fall within one chunk. When yieldSize is too 
small, scanBam() will return more than the yieldSize up to the 
break/difference in QNAME.

library(pasillaBamSubset)
fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
bf <- BamFile(fl, index=character(0), yieldSize=3, obeyQname=TRUE)
scn <- scanBam(bf, obeyQname=TRUE)
 > scn[[1]][c("qname", "seq")]
$qname
[1] "SRR031714.65"  "SRR031714.65"  "SRR031714.185" "SRR031714.185"

$seq
   A DNAStringSet instance of length 4
     width seq
[1]    37 CTGAACCGTCGAAGTATTAGCTTGCGGAACAGATCCA
[2]    37 GTTTGGGCTCGAGGGATTGGGCGATTGAGTGGCTATT
[3]    37 GGTGGATACTCGGTTTTCGCGGGAGTTGGTGAACGCA
[4]    37 AGGTGCTCGTGCCCGTGTTGCTCTAACTGGACTGACC

Combining this with position indexing is the next step. Interesting idea 
to create a 'last QNAME' flag. I'll look into that and also the 
'non-overlapping loci' approach. Thanks Ryan for sending the code link.

Valerie

On 03/14/2013 03:57 PM, Ryan C. Thompson wrote:
> A while back I was struggling with how I could process large paired bam
> files in managable chunks. I wrote a Python script called to take a BAM
> file and splits in into non-overlapping "loci", or sets of read and read
> pairs that don't overlap reads in other loci. The result is a bunch of
> small bam files, one per locus.
>
> https://raw.github.com/DarwinAwardWinner/splitloci/master/splitloci.py
>
> The important point is that read pairs only need to be matched with
> other reads in the same locus, which allows reading paired alignments
> from a position-sorted BAM file without having to read the whole file at
> once. The disadvantage of this approach is that the minimum required
> memory is dependent on the size of the largest locus in the file, and
> cannot be controlled easily. However the memory usage is almost certain
> to be significantly less than reading the entire file at once.
>
> If you want to, you could adapt the logic for reading alignment pairs.
>
> Of course, the concept of "loci" breaks down in the presence of fusion
> reads, and I'm not certain how it works with spliced reads from Tophat,
> since I believe the splice is encoded into the CIGAR string, which might
> throw off the calculation of the end position.
>
> -Ryan
>
> On 03/14/2013 03:30 PM, Michael Lawrence wrote:
>> This sounds interesting. Just want to make sure: does this ensure that
>> all
>> alignments for a given QNAME fall within one chunk? What happens if the
>> yieldsize is too small to achieve that?
>>
>> And of course the drawback here is that a QNAME-sorted BAM is not
>> indexable. So in practice one would need two BAMs, one sorted by POS, the
>> other by QNAME.
>>
>> What about this: preprocess a POS-sorted (and thus indexable) BAM and
>> add a
>> flag indicating whether an alignment is the last alignment for a given
>> QNAME. The scanner then yields all complete QNAMEs after it has scanned
>> 'yieldSize' such flags. I realize this is more work on your part, but
>> maybe
>> it saves the hassle of multiple BAMs? Just an idea.
>>
>> Michael
>>
>>
>>
>> On Thu, Mar 14, 2013 at 2:49 PM, Valerie Obenchain
>> <vobencha at fhcrc.org>wrote:
>>
>>> Hi Tom,
>>>
>>> Currently readBamGappedAlignmentPairs() reads all data from a Bam file
>>> into memory to sort and determine proper pairs. This is a problem for
>>> large
>>> files and is the reason for the error you see below.
>>>
>>> We've been working on alternative approach that involves first
>>> sorting the
>>> Bam file by qname. Once sorted, the file can be read in chunks
>>> specified by
>>> 'yieldSize' in the BamFile object. This functionality is available in
>>> GenomicRanges 1.11.38 and Rsamtools 1.11.25. It is a work in progress
>>> and I
>>> have not yet implemented a check to ensure the file is sorted by
>>> qname. If
>>> you try this out, please be sure to sort your Bam file by qname first.
>>>
>>> Examples of this approach are on the summarizeOverlaps man page for the
>>> BamFile method,
>>>
>>> ?'summarizeOverlaps,GRanges,**BamFileList-method'
>>>
>>> Here is a piece of the example:
>>>
>>> library(pasillaBamSubset)
>>> library("TxDb.Dmelanogaster.**UCSC.dm3.ensGene")
>>> exbygene <- exonsBy(TxDb.Dmelanogaster.**UCSC.dm3.ensGene, "gene")
>>>
>>> ## paired-end sorted by qname:
>>> ## Set 'singleEnd' to 'FALSE'. A BAM file sorted by qname
>>> ## can be read in chunks with 'yieldSize'.
>>> fl <- untreated3_chr4()
>>> sortfl <- sortBam(fl, tempfile(), byQname=TRUE)
>>> bf2 <- BamFileList(sortfl, index=character(0),
>>>                     yieldSize=50000, obeyQname=TRUE)
>>> se2 <- summarizeOverlaps(exbygene, bf2, singleEnd=FALSE)
>>> counts2 <- assays(se2)$counts
>>>
>>> ## paired-end not sorted:
>>> ## If the file is not sorted by qname, all records are read
>>> ## into memory for sorting and to determine proper pairs.
>>> ## Any 'yieldSize' set on the BamFile will be ignored.
>>> fl <- untreated3_chr4()
>>> bf3 <- BamFileList(fl)
>>> se3 <- summarizeOverlaps(exbygene, bf3, singleEnd=FALSE)
>>> counts3 <- assays(se3)$counts
>>>
>>>
>>> Another recent feature addition is the GALignmentsList class. The goal
>>> here is to provide a container that holds any type of read, single-end,
>>> paired-end, singletons, multiple fragments etc. Again, the methods are
>>> based on the user providing a Bam file sorted by qname. Once read in,
>>> the
>>> records are split on read id (QNAME in SAM/BAM). The associated read
>>> functions are in GenomicRanges and Rsamtools.
>>>
>>> ?readGAlignmentsList (GenomicRanges)
>>> ?readBamGAlignmentsList (Rsamtools)
>>>
>>> Let me know if you run into problems.
>>>
>>> Valerie
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 03/13/2013 11:07 AM, Thomas Whisenant wrote:
>>>
>>>> Hello all,
>>>> I am attempting to use summarizeOverlaps to assign counts to exons
>>>> using
>>>> paired end Tophat alignments. As an example, I have created a
>>>> bamfilelist
>>>> containing one bamfile and a feature list of exons by gene. This is
>>>> followed by using summarizeOverlaps with the single end parameter
>>>> set to
>>>> false.
>>>>
>>>> bfl <- BamFileList("EP04.bam", index="EP04.bam", yieldSize=1000000)
>>>> exons.by.gene <- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene")
>>>> all.exons <- unlist(exons.by.gene)
>>>> sumexp <- summarizeOverlaps(
>>>>     features=all.exons, reads=bfl,
>>>>    mode=IntersectionNotEmpty,
>>>>     singleEnd=FALSE,
>>>>     ignore.strand=TRUE, parallel=TRUE)
>>>> Error: cannot allocate vector of size 277.9 Mb
>>>> In addition: Warning message:
>>>> In readBamGappedAlignmentPairs(**bf, param = param) : 'yieldSize'
>>>> set to
>>>> 'NA'
>>>>
>>>> Even with the parallel option, this process uses all available memory
>>>> which presumably leads to the vector allocation error. While I have
>>>> set the
>>>> yieldSize in the BamFileList, this parameter does not appear to be
>>>> used.
>>>> Running summarizeOverlaps  with "singleEnd=TRUE" does not issue a
>>>> yieldSize
>>>> warning. Any ideas on how to get set the yieldSize with
>>>> "singleEnd=FALSE"?
>>>> Thanks,
>>>> Tom
>>>>
>>>>
>>>> R version 2.15.2 (2012-10-26)
>>>>
>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>
>>>>
>>>>
>>>> locale:
>>>>
>>>> [1] LC_COLLATE=English_United States.1252
>>>>
>>>> [2] LC_CTYPE=English_United States.1252
>>>>
>>>> [3] LC_MONETARY=English_United States.1252
>>>>
>>>> [4] LC_NUMERIC=C
>>>>
>>>> [5] LC_TIME=English_United States.1252
>>>>
>>>>
>>>>
>>>> attached base packages:
>>>>
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>>
>>>> [8] base
>>>>
>>>>
>>>>
>>>> other attached packages:
>>>>
>>>>    [1] DEXSeq_1.4.0
>>>>
>>>>    [2] Rsamtools_1.10.2
>>>>
>>>>    [3] org.Hs.eg.db_2.8.0
>>>>
>>>>    [4] RSQLite_0.11.2
>>>>
>>>>    [5] DBI_0.2-5
>>>>
>>>>    [6] TxDb.Hsapiens.UCSC.hg19.**knownGene_2.8.0
>>>>
>>>>    [7] GenomicFeatures_1.10.1
>>>>
>>>>    [8] BSgenome.Hsapiens.UCSC.hg19_1.**3.19
>>>>
>>>>    [9] BSgenome_1.26.1
>>>>
>>>> [10] Biostrings_2.26.3
>>>>
>>>> [11] GenomicRanges_1.10.6
>>>>
>>>> [12] IRanges_1.16.5
>>>>
>>>> [13] annotate_1.36.0
>>>>
>>>> [14] AnnotationDbi_1.20.3
>>>>
>>>> [15] Biobase_2.18.0
>>>>
>>>> [16] BiocGenerics_0.4.0
>>>>
>>>>
>>>>
>>>> loaded via a namespace (and not attached):
>>>>
>>>>    [1] biomaRt_2.14.0     bitops_1.0-5       hwriter_1.3
>>>>
>>>>    [4] RCurl_1.95-3       rtracklayer_1.18.2 statmod_1.4.17
>>>>
>>>>    [7] stats4_2.15.2      stringr_0.6.2      tools_2.15.2
>>>>
>>>> [10] XML_3.95-0.1       xtable_1.7-0       zlibbioc_1.4.0
>>>>
>>>>
>>>>
>>>> Thomas Whisenant, Ph.D.
>>>> Salomon Lab, MEM-241
>>>> Department of Molecular and Experimental Medicine
>>>> The Scripps Research Institute
>>>> 10550 North Torrey Pines Rd.
>>>> La Jolla, CA 92037
>>>> thomasw at scripps.edu<mailto:tho**masw at scripps.edu <thomasw at scripps.edu>>
>>>>
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>
>>>> Search the archives: http://news.gmane.org/gmane.**
>>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>>
>>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>