[BioC] Computing the coverage on a bamfile with a huge amount of seqlevels

Stefanie Tauber stefanie.tauber at univie.ac.at
Fri Oct 25 14:32:36 CEST 2013


Thanks a lot!
Can’t wait to try the new version!

Best,
Stefanie

Am 25.10.2013 um 12:44 schrieb Hervé Pagès <hpages at fhcrc.org>:

> Hi Stephanie,
> 
> Just to let you know that in GenomicRanges 1.14.2, coverage() should
> now be much faster on objects with many seqlevels. Should be about
> 1000x faster or more if you have 50 000 seqlevels.
> 
> The only case where it's still slow is if many seqlevels are marked as
> circular (and have a non-NA seqlength) in seqinfo(x), because for
> these seqlevels the coverage vector is still post processed by a
> loop in R. Could be a problem if for example people have BAM files
> with reads aligned to thousands of bacterial genomes.
> 
> GenomicRanges 1.14.2 and IRanges 1.20.1 should become available thru
> biocLite() in the next 24 hours or so. The man pages for the "coverage"
> methods have been improved, especially in IRanges.
> 
> Cheers,
> H.
> 
> 
> On 10/18/2013 12:19 AM, Stefanie Tauber wrote:
>> Thanks for the answer!
>> Just wanted to make sure that I don't miss a relevant feature….
>> 
>> Best,
>> Stefanie
>> 
>> Am 17.10.2013 um 22:07 schrieb Hervé Pagès <hpages at fhcrc.org
>> <mailto:hpages at fhcrc.org>>:
>> 
>>> Hi Stephanie,
>>> 
>>> On 10/17/2013 07:37 AM, Stefanie Tauber wrote:
>>>> What is the best way to compute the coverage on a ReadGAlignments object
>>>> when I have many, many (about 50 000) seqlevels?
>>>> 
>>>> Just computing coverage(object) is terribly slow due to the huge
>>>> amount of seqlevels.
>>>> 
>>>> At the moment, I am subsetting the ReadGAlignments object for each
>>>> seqlevel,
>>>> I omit the redundant seqlevels and calculate the coverage.
>>>> This is fast if you do it for a few seqlevels.
>>>> But if one really would like to compute the coverage for each
>>>> seqlevel individually,
>>>> this really takes a long time..
>>> 
>>> Yes this is a known weakness in the current implementation of
>>> coverage(). It loops over the seqlevels and this loop is currently
>>> performed in R, which is slow. Unfortunately this is not the first
>>> time someone complains about this.
>>> 
>>> Time for us to bite the bullet I guess. I've just started to work
>>> on moving the loop to the C-level. That should speed up things
>>> significantly. I'll let you know when this is ready.
>>> 
>>> Thanks for the feedback.
>>> H.
>>> 
>>>> 
>>>> 
>>>> I would appreciate any comments!
>>>> 
>>>> Best,
>>>> Stefanie
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> 
>>> 
>>> --
>>> Hervé Pagès
>>> 
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>> 
>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>> 
>> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioconductor mailing list