[BioC] Varied performance of readGappedAlignmentPairs function

Hervé Pagès hpages at fhcrc.org
Fri Jun 21 20:39:24 CEST 2013


Hi Michael, Pan,

On 06/21/2013 05:13 AM, Michael Lawrence wrote:
> The main problem with the performance of this function is that it
> appears to have quadratic complexity. The yield-size approach would
> mitigate that, but I'm left wondering why it has those characteristics.
> Just R memory management?


>     Anyway, I just spent some time optimizing findMateAlignment() a little
>     bit and the timings are slightly better now (2x to 5x faster or more,
>     depending on the size of the file):
>
>               nb of alignments |         time | required memory
>               -----------------+------------__--+----------------
>                     8 millions |       32 sec |          1.7 GB
>                    16 millions |       65 sec |          3.3 GB
>                    32 millions | 2 min 25 sec |          6.5 GB
>                    64 millions | 5 min 40 sec |         13.0 GB
>

Doesn't look quadratic with respect to the size of the file.

But it is quadratic with respect to the "average nb of records per
unique QNAME". The above timings are for an GAlignments object coming
from a file where this value is 2.04 (close to the optimal value of 2).
If this value is 3, expect those timings to be multiplied by 3. If it's
4, by 6x. If it's 5, by 10. If it's 6, by 15.

I don't think it has much to do with R memory management. It's more
the way the algo works: alignments are first grouped by QNAME, then
within each group, all possible pairs are examined. So if there are
6 alignments in a group, there are 15 pairs to examine. It's a very
bold algo and it could certainly be improved.

But as I said earlier, all this can be avoided by setting the
isNotPrimaryRead flag (flag 0x100) to FALSE in ScanBamParam().
That should *in theory* bring the "average nb of records per unique
QNAME" down to 2. Then the complexity is almost linear with respect
to the size of the BAM file (actually it's more something like
n x log(n) because of the grouping by QNAME that is based on qsort()).

However I've seen some aligners not setting properly the 0x100 flag
for secondary alignments so using the isNotPrimaryRead=FALSE does not
always bring the "average nb of records per unique QNAME" down to 2.

If you guys have BAM files where this value is 2 (or is > 2 but you
use isNotPrimaryRead=FALSE) and you still observe abnormal timings
(after you've upgraded to Rsamtools 1.13.19), please send them to
me (or make them available somewhere) and I'll have a look at it.

Thanks!
H.


>     This is with Rsamtools 1.13.19 (BioC devel) which should become
>     available via biocLite() in the next 36 hours or so.
>
>     I should also mention that Martin and Val have plans to make
>     readGappedAlignmentPairs() work on a BamFile object with a
>     user-specified yieldSize. This will allow processing the file
>     by chunk (like it's already possible with readGappedAlignments()).
>
>     Cheers,
>     H.
>
>
>         Thanks!
>
>         Pan
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>     _________________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>     Search the archives:
>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list