[BioC] readGapppedAlignmentpairs questions

Hervé Pagès hpages at fhcrc.org
Fri Jun 22 20:11:22 CEST 2012


Hi Iyer,

On 06/22/2012 06:56 AM, Lakshmanan Iyer wrote:
> Hi
> My apologies for multiple posting if it happens-- I sent the last mails
> from other accounts which may not be registered with Bioc-list
> -Lax
> Two questions:
>
> 1. Is readGapppedAlignmentPairs - the most efficient way to read a
> paired-end bam file with mulit-mapped reads?
> I am asking as it takes an enormous amount of time to process and load.

With a recent version of Rsamtools (>= 1.9.10), last time I tried it
took between 30 and 40 min to call readGapppedAlignmentPairs() on a BAM
file with 70 million records (35 million pairs). Even for a fixed nb of
records and using the same machine, timing could vary a lot depending
on how "hard" it is to pair the records which mostly depends on the
average number of records sharing the same QNAME (the lowest, the
easiest). You can get that number using the new quickBamCounts() utility
from Rsamtools:

 > quickBamCounts("AML_330-0_gsnap_filter.bam")
                                 group |    nb of |    nb of | mean / max
                                    of |  records |   unique | records per
                               records | in group |   QNAMEs | unique QNAME
All records........................ A | 70446309 | 35407913 | 1.99 / 2
   o template has single segment.... S |        0 |        0 |   NA / NA
   o template has multiple segments. M | 70446309 | 35407913 | 1.99 / 2
       - first segment.............. F | 35313360 | 35313360 |    1 / 1
       - last segment............... L | 35132949 | 35132949 |    1 / 1
       - other segment.............. O |        0 |        0 |   NA / NA

Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning 
of M.
Indentation reflects this.

Here the average number of records per unique QNAME is 1.99 (and the max
is 2), which is ideal.

I don't remember the exact amount of memory needed to load that file
with 70M records though. All I remember is that you need a lot :-/
(probably > 20GB). Make sure you have enough memory and that your system
is not swapping.

readGapppedAlignmentPairs() is basically calling readGapppedAlignments()
followed by findMateAlignment(). Each of the 2 operations are expensive
and IIRC I'm not sure there are obvious/easy optimizations that could
be done on readGapppedAlignments() itself. However, for
findMateAlignment(), there are a few easy ones that have been
on my list for a couple of months now and that I will implement ASAP.
They should improve speed and reduce memory usage.

>
> 2. How does one work with coverage on GappedAlignmentPairs  in the context
> of RNASeq?
> The simplest way is to consider each left and right read as separate -
> essentially loose the "paired" information and calculate coverage.
> However, if both the left and right pair reads fall within a feature of
> interest - say an exon, does it imply coverage of the region of the exon
> between the reads too
>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>    LLLLLLLLLL---------------------------RRRRRRRRRR
>                      ^^^^^^^^^^^^^^^^^
>
> In the figure above, the exon is represented by ">" and L and R represents
> the left and right reads aligned to the exon.
> I am talking about the region represented by "^".  Do we assume coverage
> for this region too?
> Does Coverage on GappedAlignmentPairs do this?

No, coverage on a GappedAlignmentPairs does not do this, but
'coverage(range(grglist(galp)))' will do this.

Cheers,
H.

>
> -best
> -Lax
> Center for Neuroscience Research
> Tufts Univeristy School of Medicine
> Boston, MA
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list