[BioC] Varied performance of readGappedAlignmentPairs function

Hervé Pagès hpages at fhcrc.org
Fri Jun 21 11:09:52 CEST 2013


Hi Pan,

Sorry for the delay in answering this.

On 06/05/2013 11:46 PM, Pan Du wrote:
> Hi Herve
>
> I am using readGappedAlignmentPairs to retrieve the paired read information from BAM files. I found the performance of the function changed a lot from sample to sample, and it seems not related with the size of BAM files.
>   Also the increase of running time is not linear with the increase of transcript number, from which we want to retrieve information. Michael suggests it may be related with how readGappedAlignmentPairs function responds to weird alignments. I want to know whether you also observed similar problems and whether there is a solution for it.

The time of the pairing algorithm (implemented in findMateAlignment())
depends of course on the number of records in the BAM file, but also
on the "average nb of records per unique QNAME". You can quickly get
this value with the quickCountBam() utility:

 > library(RNAseqData.HNRNPC.bam.chr14)
 > bamfile <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L]
 > quickCountBam(bamfile, mainGroupsOnly=TRUE)
                                 group |    nb of |    nb of | mean / max
                                    of |  records |   unique | records per
                               records | in group |   QNAMEs | unique QNAME
All records........................ A |   800484 |   393300 | 2.04 / 10
   o template has single segment.... S |        0 |        0 |   NA / NA
   o template has multiple segments. M |   800484 |   393300 | 2.04 / 10
       - first segment.............. F |   400242 |   393300 | 1.02 / 5
       - last segment............... L |   400242 |   393300 | 1.02 / 5
       - other segment.............. O |        0 |        0 |   NA / NA

The value I'm talking about is the first value in the last column on
the 3rd line (M group): 2.04. With paired-end reads, this value is
always >= 2. If the aligner only reported primary alignments, it should
be exactly 2. It's > 2 only if secondary alignments were also reported
(up to 5 alignments per read for this BAM file).

A value of 2 (i.e. only primary reads) is optimal for the pairing
algorithm. A greater value, say > 3 will significantly degrade its
performance. An easy way to avoid this degradation is to load only
primary alignments by setting the appropriate flag in ScanBamParam().

Anyway, I just spent some time optimizing findMateAlignment() a little
bit and the timings are slightly better now (2x to 5x faster or more,
depending on the size of the file):

          nb of alignments |         time | required memory
          -----------------+--------------+----------------
                8 millions |       32 sec |          1.7 GB
               16 millions |       65 sec |          3.3 GB
               32 millions | 2 min 25 sec |          6.5 GB
               64 millions | 5 min 40 sec |         13.0 GB

This is with Rsamtools 1.13.19 (BioC devel) which should become
available via biocLite() in the next 36 hours or so.

I should also mention that Martin and Val have plans to make
readGappedAlignmentPairs() work on a BamFile object with a
user-specified yieldSize. This will allow processing the file
by chunk (like it's already possible with readGappedAlignments()).

Cheers,
H.


> Thanks!
>
> Pan
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list