[BioC] readGapppedAlignmentpairs questions

Martin Morgan mtmorgan at fhcrc.org
Sat Jun 23 13:21:59 CEST 2012


On 06/22/2012 07:22 AM, Wei Shi wrote:
> Dear Lakshmanan,
>
> If the purpose of your analysis is to count reads falling within each feature, you may consider using the featureCounts() function in Rsubread package. It takes only about 2 minutes to summarize 10 million reads into a count table. But it only accept SAM files (you can use samtools to convert your BAM files to SAM files) and it only works on unix. See ?featureCounts() for more info.

Hi Wei -- can you clarify how you are counting reads? From a quick scan 
of your man page / C source code it looks like you're counting each pair 
of a paired end separately, and looking for a read whose start position 
is in an exon / gene? This elementary counting scheme (on a bam file) is 
just

   ## what features? any GRanges or GRangesList, e.g.,
   library(TxDb.Hsapiens.UCSC.hg19.knownGene)
   exByGn = exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")

   ## what reads? GRanges from 'rname' and 'pos'
   param = ScanBamParam(what=c("rname", "qual"),
                        flag=scanBamFlag(isUnmappedRead=FALSE))
   with(scanBam(aBamFile, param=p2)[[1]], {
       GRanges(rname, IRanges(pos, width=1))
   })

   ## count, 'top level' of GRangesList, so counts per gene
   countOverlaps(exByGn, reads)

This will be fast and memory friendly. ?countBam is another alternative, 
also memory efficient and taking this simple approach.

?summarizeOverlaps gives better counting schemes for single-end reads, 
and is also reasonably fast (and in devel space efficient, iterating 
over the bam file, and with some support for paired-end reads).

?readGappedAlignmentPairs, from the original post, tries to make sense 
of paired end reads, and is less memory / speed friendly (but the OP has 
a lot of memory).

Martin

>
> For your second question, if the pair of reads is indeed mapped as a pair, then the region between them will be covered as well if the two reads are on the same exon. But the reality is that not every read pair can be successfully mapped as pairs. You may get only one end mapped, or the two ends are mapped to two locations which have a distance much bigger than the average fragment lengths. In these cases, you don't even know what are the exons which lie between the two reads.
>
> Hope this helps.
>
> Cheers,
> Wei
>
> On Jun 22, 2012, at 11:56 PM, Lakshmanan Iyer wrote:
>
>> Hi
>> My apologies for multiple posting if it happens-- I sent the last mails
>> from other accounts which may not be registered with Bioc-list
>> -Lax
>> Two questions:
>>
>> 1. Is readGapppedAlignmentPairs - the most efficient way to read a
>> paired-end bam file with mulit-mapped reads?
>> I am asking as it takes an enormous amount of time to process and load.
>>
>> 2. How does one work with coverage on GappedAlignmentPairs  in the context
>> of RNASeq?
>> The simplest way is to consider each left and right read as separate -
>> essentially loose the "paired" information and calculate coverage.
>> However, if both the left and right pair reads fall within a feature of
>> interest - say an exon, does it imply coverage of the region of the exon
>> between the reads too
>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>   LLLLLLLLLL---------------------------RRRRRRRRRR
>>                     ^^^^^^^^^^^^^^^^^
>>
>> In the figure above, the exon is represented by ">" and L and R represents
>> the left and right reads aligned to the exon.
>> I am talking about the region represented by "^".  Do we assume coverage
>> for this region too?
>> Does Coverage on GappedAlignmentPairs do this?
>>
>> -best
>> -Lax
>> Center for Neuroscience Research
>> Tufts Univeristy School of Medicine
>> Boston, MA
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:17}}



More information about the Bioconductor mailing list