[BioC] Computing the coverage on a bamfile with a huge amount of seqlevels

Hervé Pagès hpages at fhcrc.org
Fri Oct 25 12:44:03 CEST 2013


Hi Stephanie,

Just to let you know that in GenomicRanges 1.14.2, coverage() should
now be much faster on objects with many seqlevels. Should be about
1000x faster or more if you have 50 000 seqlevels.

The only case where it's still slow is if many seqlevels are marked as
circular (and have a non-NA seqlength) in seqinfo(x), because for
these seqlevels the coverage vector is still post processed by a
loop in R. Could be a problem if for example people have BAM files
with reads aligned to thousands of bacterial genomes.

GenomicRanges 1.14.2 and IRanges 1.20.1 should become available thru
biocLite() in the next 24 hours or so. The man pages for the "coverage"
methods have been improved, especially in IRanges.

Cheers,
H.


On 10/18/2013 12:19 AM, Stefanie Tauber wrote:
> Thanks for the answer!
> Just wanted to make sure that I don't miss a relevant feature….
>
> Best,
> Stefanie
>
> Am 17.10.2013 um 22:07 schrieb Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>>:
>
>> Hi Stephanie,
>>
>> On 10/17/2013 07:37 AM, Stefanie Tauber wrote:
>>> What is the best way to compute the coverage on a ReadGAlignments object
>>> when I have many, many (about 50 000) seqlevels?
>>>
>>> Just computing coverage(object) is terribly slow due to the huge
>>> amount of seqlevels.
>>>
>>> At the moment, I am subsetting the ReadGAlignments object for each
>>> seqlevel,
>>>  I omit the redundant seqlevels and calculate the coverage.
>>> This is fast if you do it for a few seqlevels.
>>> But if one really would like to compute the coverage for each
>>> seqlevel individually,
>>> this really takes a long time..
>>
>> Yes this is a known weakness in the current implementation of
>> coverage(). It loops over the seqlevels and this loop is currently
>> performed in R, which is slow. Unfortunately this is not the first
>> time someone complains about this.
>>
>> Time for us to bite the bullet I guess. I've just started to work
>> on moving the loop to the C-level. That should speed up things
>> significantly. I'll let you know when this is ready.
>>
>> Thanks for the feedback.
>> H.
>>
>>>
>>>
>>> I would appreciate any comments!
>>>
>>> Best,
>>> Stefanie
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list