[BioC] readGapppedAlignmentpairs questions

Martin Morgan mtmorgan at fhcrc.org
Sat Jun 23 13:44:13 CEST 2012


On 06/23/2012 04:21 AM, Martin Morgan wrote:
> On 06/22/2012 07:22 AM, Wei Shi wrote:
>> Dear Lakshmanan,
>>
>> If the purpose of your analysis is to count reads falling within each
>> feature, you may consider using the featureCounts() function in
>> Rsubread package. It takes only about 2 minutes to summarize 10
>> million reads into a count table. But it only accept SAM files (you
>> can use samtools to convert your BAM files to SAM files) and it only
>> works on unix. See ?featureCounts() for more info.
>
> Hi Wei -- can you clarify how you are counting reads? From a quick scan
> of your man page / C source code it looks like you're counting each pair
> of a paired end separately, and looking for a read whose start position
> is in an exon / gene? This elementary counting scheme (on a bam file) is
> just
>
> ## what features? any GRanges or GRangesList, e.g.,
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> exByGn = exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
>
> ## what reads? GRanges from 'rname' and 'pos'
> param = ScanBamParam(what=c("rname", "qual"),
> flag=scanBamFlag(isUnmappedRead=FALSE))
> with(scanBam(aBamFile, param=p2)[[1]], {
> GRanges(rname, IRanges(pos, width=1))
> })

Oops, should have assigned that GRanges to 'reads'

   reads = with(scanBam(aBamFile, param=p2)[[1]], {
       GRanges(rname, IRanges(pos, width=1))
   })

The scheme has obvious limitations, counting each end of a paired-end 
read separately, counting reads that overhang gene or exon boundaries 
(though sometimes that might be ok), counting overhanging reads that 
start in a range but not those that end, etc.

Martin
>
> ## count, 'top level' of GRangesList, so counts per gene
> countOverlaps(exByGn, reads)
>
> This will be fast and memory friendly. ?countBam is another alternative,
> also memory efficient and taking this simple approach.
>
> ?summarizeOverlaps gives better counting schemes for single-end reads,
> and is also reasonably fast (and in devel space efficient, iterating
> over the bam file, and with some support for paired-end reads).
>
> ?readGappedAlignmentPairs, from the original post, tries to make sense
> of paired end reads, and is less memory / speed friendly (but the OP has
> a lot of memory).
>
> Martin
>
>>
>> For your second question, if the pair of reads is indeed mapped as a
>> pair, then the region between them will be covered as well if the two
>> reads are on the same exon. But the reality is that not every read
>> pair can be successfully mapped as pairs. You may get only one end
>> mapped, or the two ends are mapped to two locations which have a
>> distance much bigger than the average fragment lengths. In these
>> cases, you don't even know what are the exons which lie between the
>> two reads.
>>
>> Hope this helps.
>>
>> Cheers,
>> Wei
>>
>> On Jun 22, 2012, at 11:56 PM, Lakshmanan Iyer wrote:
>>
>>> Hi
>>> My apologies for multiple posting if it happens-- I sent the last mails
>>> from other accounts which may not be registered with Bioc-list
>>> -Lax
>>> Two questions:
>>>
>>> 1. Is readGapppedAlignmentPairs - the most efficient way to read a
>>> paired-end bam file with mulit-mapped reads?
>>> I am asking as it takes an enormous amount of time to process and load.
>>>
>>> 2. How does one work with coverage on GappedAlignmentPairs in the
>>> context
>>> of RNASeq?
>>> The simplest way is to consider each left and right read as separate -
>>> essentially loose the "paired" information and calculate coverage.
>>> However, if both the left and right pair reads fall within a feature of
>>> interest - say an exon, does it imply coverage of the region of the exon
>>> between the reads too
>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>> LLLLLLLLLL---------------------------RRRRRRRRRR
>>> ^^^^^^^^^^^^^^^^^
>>>
>>> In the figure above, the exon is represented by ">" and L and R
>>> represents
>>> the left and right reads aligned to the exon.
>>> I am talking about the region represented by "^". Do we assume coverage
>>> for this region too?
>>> Does Coverage on GappedAlignmentPairs do this?
>>>
>>> -best
>>> -Lax
>>> Center for Neuroscience Research
>>> Tufts Univeristy School of Medicine
>>> Boston, MA
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> ______________________________________________________________________
>> The information in this email is confidential and inte...{{dropped:17}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list