[BioC] Computing the coverage on a bamfile with a huge amount of seqlevels

Mon Oct 28 13:11:51 CET 2013

Hi Hervé,

I just wanted to try the new coverage function:

# I read in a bam file, human data mapped against the transcriptome
x = readGAlignments(human_trans_bam)

# This is my GAligments object
> x
GAlignments with 46765032 alignments and 0 metadata columns:
                                 seqnames strand       cigar    qwidth
                                    <Rle>  <Rle> <character> <integer>
         [1] hg19_ensGene_ENST00000237247      +         75M        75
         [2] hg19_ensGene_ENST00000371036      -         75M        75
         [3] hg19_ensGene_ENST00000371037      -         75M        75
         [4] hg19_ensGene_ENST00000471889      -         75M        75
         [5] hg19_ensGene_ENST00000377479      +         75M        75
         ...                          ...    ...         ...       ...
  [46765028] hg19_ensGene_ENST00000344867      -         75M        75
  [46765029] hg19_ensGene_ENST00000400848      -         75M        75
  [46765030] hg19_ensGene_ENST00000400848      -         75M        75
  [46765031] hg19_ensGene_ENST00000400848      -         75M        75
  [46765032] hg19_ensGene_ENST00000400829      -         75M        75
                 start       end     width      ngap
             <integer> <integer> <integer> <integer>
         [1]      1783      1857        75         0
         [2]       690       764        75         0
         [3]      1632      1706        75         0
         [4]      2121      2195        75         0
         [5]       391       465        75         0
         ...       ...       ...       ...       ...
  [46765028]       280       354        75         0
  [46765029]       509       583        75         0
  [46765030]       509       583        75         0
  [46765031]       509       583        75         0
  [46765032]       179       253        75         0
  ---
  seqlengths:
   hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829
                           2580 ...                          999

When I want to compute the coverage, I get the following:
> cov_x = coverage(x)

Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
  object '.Ranges.coverage' not found

Do you know what I am doing wrong?

Thanks,
Stefanie

On Fr, 25.10.2013, 11:44, Hervé Pagès wrote:
> Hi Stephanie,
>
> Just to let you know that in GenomicRanges 1.14.2, coverage() should
> now be much faster on objects with many seqlevels. Should be about
> 1000x faster or more if you have 50 000 seqlevels.
>
> The only case where it's still slow is if many seqlevels are marked as
> circular (and have a non-NA seqlength) in seqinfo(x), because for
> these seqlevels the coverage vector is still post processed by a
> loop in R. Could be a problem if for example people have BAM files
> with reads aligned to thousands of bacterial genomes.
>
> GenomicRanges 1.14.2 and IRanges 1.20.1 should become available thru
> biocLite() in the next 24 hours or so. The man pages for the "coverage"
> methods have been improved, especially in IRanges.
>
> Cheers,
> H.
>
>
> On 10/18/2013 12:19 AM, Stefanie Tauber wrote:
>> Thanks for the answer!
>> Just wanted to make sure that I don't miss a relevant feature….
>>
>> Best,
>> Stefanie
>>
>> Am 17.10.2013 um 22:07 schrieb Hervé Pagès <hpages at fhcrc.org
>> <mailto:hpages at fhcrc.org>>:
>>
>>> Hi Stephanie,
>>>
>>> On 10/17/2013 07:37 AM, Stefanie Tauber wrote:
>>>> What is the best way to compute the coverage on a ReadGAlignments
>>>> object
>>>> when I have many, many (about 50 000) seqlevels?
>>>>
>>>> Just computing coverage(object) is terribly slow due to the huge
>>>> amount of seqlevels.
>>>>
>>>> At the moment, I am subsetting the ReadGAlignments object for each
>>>> seqlevel,
>>>>  I omit the redundant seqlevels and calculate the coverage.
>>>> This is fast if you do it for a few seqlevels.
>>>> But if one really would like to compute the coverage for each
>>>> seqlevel individually,
>>>> this really takes a long time..
>>>
>>> Yes this is a known weakness in the current implementation of
>>> coverage(). It loops over the seqlevels and this loop is currently
>>> performed in R, which is slow. Unfortunately this is not the first
>>> time someone complains about this.
>>>
>>> Time for us to bite the bullet I guess. I've just started to work
>>> on moving the loop to the C-level. That should speed up things
>>> significantly. I'll let you know when this is ready.
>>>
>>> Thanks for the feedback.
>>> H.
>>>
>>>>
>>>>
>>>> I would appreciate any comments!
>>>>
>>>> Best,
>>>> Stefanie
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319