[BioC] yieldSize and paired end data in SummarizeOverlaps error?

Fri Mar 15 18:21:38 CET 2013

No problem. Feel free to contact me if you have any questions about the 
code.

On Fri 15 Mar 2013 09:43:40 AM PDT, Valerie Obenchain wrote:
> Hi Michael and Ryan,
>
> Yes, all QNAME records fall within one chunk. When yieldSize is too
> small, scanBam() will return more than the yieldSize up to the
> break/difference in QNAME.
>
> library(pasillaBamSubset)
> fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
> bf <- BamFile(fl, index=character(0), yieldSize=3, obeyQname=TRUE)
> scn <- scanBam(bf, obeyQname=TRUE)
> > scn[[1]][c("qname", "seq")]
> $qname
> [1] "SRR031714.65"  "SRR031714.65"  "SRR031714.185" "SRR031714.185"
>
> $seq
>   A DNAStringSet instance of length 4
>     width seq
> [1]    37 CTGAACCGTCGAAGTATTAGCTTGCGGAACAGATCCA
> [2]    37 GTTTGGGCTCGAGGGATTGGGCGATTGAGTGGCTATT
> [3]    37 GGTGGATACTCGGTTTTCGCGGGAGTTGGTGAACGCA
> [4]    37 AGGTGCTCGTGCCCGTGTTGCTCTAACTGGACTGACC
>
>
> Combining this with position indexing is the next step. Interesting
> idea to create a 'last QNAME' flag. I'll look into that and also the
> 'non-overlapping loci' approach. Thanks Ryan for sending the code link.
>
>
> Valerie
>
> On 03/14/2013 03:57 PM, Ryan C. Thompson wrote:
>> A while back I was struggling with how I could process large paired bam
>> files in managable chunks. I wrote a Python script called to take a BAM
>> file and splits in into non-overlapping "loci", or sets of read and read
>> pairs that don't overlap reads in other loci. The result is a bunch of
>> small bam files, one per locus.
>>
>> https://raw.github.com/DarwinAwardWinner/splitloci/master/splitloci.py
>>
>> The important point is that read pairs only need to be matched with
>> other reads in the same locus, which allows reading paired alignments
>> from a position-sorted BAM file without having to read the whole file at
>> once. The disadvantage of this approach is that the minimum required
>> memory is dependent on the size of the largest locus in the file, and
>> cannot be controlled easily. However the memory usage is almost certain
>> to be significantly less than reading the entire file at once.
>>
>> If you want to, you could adapt the logic for reading alignment pairs.
>>
>> Of course, the concept of "loci" breaks down in the presence of fusion
>> reads, and I'm not certain how it works with spliced reads from Tophat,
>> since I believe the splice is encoded into the CIGAR string, which might
>> throw off the calculation of the end position.
>>
>> -Ryan
>>
>> On 03/14/2013 03:30 PM, Michael Lawrence wrote:
>>> This sounds interesting. Just want to make sure: does this ensure that
>>> all
>>> alignments for a given QNAME fall within one chunk? What happens if the
>>> yieldsize is too small to achieve that?
>>>
>>> And of course the drawback here is that a QNAME-sorted BAM is not
>>> indexable. So in practice one would need two BAMs, one sorted by
>>> POS, the
>>> other by QNAME.
>>>
>>> What about this: preprocess a POS-sorted (and thus indexable) BAM and
>>> add a
>>> flag indicating whether an alignment is the last alignment for a given
>>> QNAME. The scanner then yields all complete QNAMEs after it has scanned
>>> 'yieldSize' such flags. I realize this is more work on your part, but
>>> maybe
>>> it saves the hassle of multiple BAMs? Just an idea.
>>>
>>> Michael
>>>
>>>
>>>
>>> On Thu, Mar 14, 2013 at 2:49 PM, Valerie Obenchain
>>> <vobencha at fhcrc.org>wrote:
>>>
>>>> Hi Tom,
>>>>
>>>> Currently readBamGappedAlignmentPairs() reads all data from a Bam file
>>>> into memory to sort and determine proper pairs. This is a problem for
>>>> large
>>>> files and is the reason for the error you see below.
>>>>
>>>> We've been working on alternative approach that involves first
>>>> sorting the
>>>> Bam file by qname. Once sorted, the file can be read in chunks
>>>> specified by
>>>> 'yieldSize' in the BamFile object. This functionality is available in
>>>> GenomicRanges 1.11.38 and Rsamtools 1.11.25. It is a work in progress
>>>> and I
>>>> have not yet implemented a check to ensure the file is sorted by
>>>> qname. If
>>>> you try this out, please be sure to sort your Bam file by qname first.
>>>>
>>>> Examples of this approach are on the summarizeOverlaps man page for
>>>> the
>>>> BamFile method,
>>>>
>>>> ?'summarizeOverlaps,GRanges,**BamFileList-method'
>>>>
>>>> Here is a piece of the example:
>>>>
>>>> library(pasillaBamSubset)
>>>> library("TxDb.Dmelanogaster.**UCSC.dm3.ensGene")
>>>> exbygene <- exonsBy(TxDb.Dmelanogaster.**UCSC.dm3.ensGene, "gene")
>>>>
>>>> ## paired-end sorted by qname:
>>>> ## Set 'singleEnd' to 'FALSE'. A BAM file sorted by qname
>>>> ## can be read in chunks with 'yieldSize'.
>>>> fl <- untreated3_chr4()
>>>> sortfl <- sortBam(fl, tempfile(), byQname=TRUE)
>>>> bf2 <- BamFileList(sortfl, index=character(0),
>>>>                     yieldSize=50000, obeyQname=TRUE)
>>>> se2 <- summarizeOverlaps(exbygene, bf2, singleEnd=FALSE)
>>>> counts2 <- assays(se2)$counts
>>>>
>>>> ## paired-end not sorted:
>>>> ## If the file is not sorted by qname, all records are read
>>>> ## into memory for sorting and to determine proper pairs.
>>>> ## Any 'yieldSize' set on the BamFile will be ignored.
>>>> fl <- untreated3_chr4()
>>>> bf3 <- BamFileList(fl)
>>>> se3 <- summarizeOverlaps(exbygene, bf3, singleEnd=FALSE)
>>>> counts3 <- assays(se3)$counts
>>>>
>>>>
>>>> Another recent feature addition is the GALignmentsList class. The goal
>>>> here is to provide a container that holds any type of read,
>>>> single-end,
>>>> paired-end, singletons, multiple fragments etc. Again, the methods are
>>>> based on the user providing a Bam file sorted by qname. Once read in,
>>>> the
>>>> records are split on read id (QNAME in SAM/BAM). The associated read
>>>> functions are in GenomicRanges and Rsamtools.
>>>>
>>>> ?readGAlignmentsList (GenomicRanges)
>>>> ?readBamGAlignmentsList (Rsamtools)
>>>>
>>>> Let me know if you run into problems.
>>>>
>>>> Valerie
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 03/13/2013 11:07 AM, Thomas Whisenant wrote:
>>>>
>>>>> Hello all,
>>>>> I am attempting to use summarizeOverlaps to assign counts to exons
>>>>> using
>>>>> paired end Tophat alignments. As an example, I have created a
>>>>> bamfilelist
>>>>> containing one bamfile and a feature list of exons by gene. This is
>>>>> followed by using summarizeOverlaps with the single end parameter
>>>>> set to
>>>>> false.
>>>>>
>>>>> bfl <- BamFileList("EP04.bam", index="EP04.bam", yieldSize=1000000)
>>>>> exons.by.gene <- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene")
>>>>> all.exons <- unlist(exons.by.gene)
>>>>> sumexp <- summarizeOverlaps(
>>>>>     features=all.exons, reads=bfl,
>>>>>    mode=IntersectionNotEmpty,
>>>>>     singleEnd=FALSE,
>>>>>     ignore.strand=TRUE, parallel=TRUE)
>>>>> Error: cannot allocate vector of size 277.9 Mb
>>>>> In addition: Warning message:
>>>>> In readBamGappedAlignmentPairs(**bf, param = param) : 'yieldSize'
>>>>> set to
>>>>> 'NA'
>>>>>
>>>>> Even with the parallel option, this process uses all available memory
>>>>> which presumably leads to the vector allocation error. While I have
>>>>> set the
>>>>> yieldSize in the BamFileList, this parameter does not appear to be
>>>>> used.
>>>>> Running summarizeOverlaps  with "singleEnd=TRUE" does not issue a
>>>>> yieldSize
>>>>> warning. Any ideas on how to get set the yieldSize with
>>>>> "singleEnd=FALSE"?
>>>>> Thanks,
>>>>> Tom
>>>>>
>>>>>
>>>>> R version 2.15.2 (2012-10-26)
>>>>>
>>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>
>>>>>
>>>>>
>>>>> locale:
>>>>>
>>>>> [1] LC_COLLATE=English_United States.1252
>>>>>
>>>>> [2] LC_CTYPE=English_United States.1252
>>>>>
>>>>> [3] LC_MONETARY=English_United States.1252
>>>>>
>>>>> [4] LC_NUMERIC=C
>>>>>
>>>>> [5] LC_TIME=English_United States.1252
>>>>>
>>>>>
>>>>>
>>>>> attached base packages:
>>>>>
>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>> methods
>>>>>
>>>>> [8] base
>>>>>
>>>>>
>>>>>
>>>>> other attached packages:
>>>>>
>>>>>    [1] DEXSeq_1.4.0
>>>>>
>>>>>    [2] Rsamtools_1.10.2
>>>>>
>>>>>    [3] org.Hs.eg.db_2.8.0
>>>>>
>>>>>    [4] RSQLite_0.11.2
>>>>>
>>>>>    [5] DBI_0.2-5
>>>>>
>>>>>    [6] TxDb.Hsapiens.UCSC.hg19.**knownGene_2.8.0
>>>>>
>>>>>    [7] GenomicFeatures_1.10.1
>>>>>
>>>>>    [8] BSgenome.Hsapiens.UCSC.hg19_1.**3.19
>>>>>
>>>>>    [9] BSgenome_1.26.1
>>>>>
>>>>> [10] Biostrings_2.26.3
>>>>>
>>>>> [11] GenomicRanges_1.10.6
>>>>>
>>>>> [12] IRanges_1.16.5
>>>>>
>>>>> [13] annotate_1.36.0
>>>>>
>>>>> [14] AnnotationDbi_1.20.3
>>>>>
>>>>> [15] Biobase_2.18.0
>>>>>
>>>>> [16] BiocGenerics_0.4.0
>>>>>
>>>>>
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>>
>>>>>    [1] biomaRt_2.14.0     bitops_1.0-5       hwriter_1.3
>>>>>
>>>>>    [4] RCurl_1.95-3       rtracklayer_1.18.2 statmod_1.4.17
>>>>>
>>>>>    [7] stats4_2.15.2      stringr_0.6.2      tools_2.15.2
>>>>>
>>>>> [10] XML_3.95-0.1       xtable_1.7-0       zlibbioc_1.4.0
>>>>>
>>>>>
>>>>>
>>>>> Thomas Whisenant, Ph.D.
>>>>> Salomon Lab, MEM-241
>>>>> Department of Molecular and Experimental Medicine
>>>>> The Scripps Research Institute
>>>>> 10550 North Torrey Pines Rd.
>>>>> La Jolla, CA 92037
>>>>> thomasw at scripps.edu<mailto:tho**masw at scripps.edu
>>>>> <thomasw at scripps.edu>>
>>>>>
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>>
>>>>>
>>>>> Search the archives: http://news.gmane.org/gmane.**
>>>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>
>>>>
>>>> Search the archives: http://news.gmane.org/gmane.**
>>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>>
>>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>