[BioC] Computing the coverage on a bamfile with a huge amount of seqlevels

Hervé Pagès hpages at fhcrc.org
Mon Oct 28 19:12:28 CET 2013


Hi Stephanie,

GenomicRanges 1.14.2 (in BioC release) is available since last Friday.

In BioC devel though, other problems got in the way but GenomicRanges
1.15.5 is finally out there (since a couple of hours).

Cheers,
H.


On 10/28/2013 09:24 AM, Michael Lawrence wrote:
> Looks like there's a problem preventing newer GenomicRanges from being
> published. Doesn't look like it's my fault this time ;) Only option is
> to grab from svn right now.
>
> Michael
>
>
> On Mon, Oct 28, 2013 at 7:00 AM, Stefanie Tauber
> <stefanie.tauber at univie.ac.at <mailto:stefanie.tauber at univie.ac.at>> wrote:
>
>     Here is my SessionInfo,
>     GenomicRanges as well as IRanges should be up-to-date.
>
>     R Under development (unstable) (2013-10-15 r64056)
>     Platform: x86_64-unknown-linux-gnu (64-bit)
>
>     locale:
>       [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>       [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>       [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>       [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>       [9] LC_ADDRESS=C               LC_TELEPHONE=C
>     [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>     attached base packages:
>     [1] parallel  stats     graphics  grDevices utils     datasets  methods
>     [8] base
>
>     other attached packages:
>     [1] Rsamtools_1.15.2       Biostrings_2.31.0      GenomicFeatures_1.15.2
>     [4] AnnotationDbi_1.25.2   Biobase_2.23.1         GenomicRanges_1.15.1
>     [7] XVector_0.3.0          IRanges_1.21.5         BiocGenerics_0.9.0
>
>     loaded via a namespace (and not attached):
>       [1] biomaRt_2.19.0     bitops_1.0-6       BSgenome_1.31.0    DBI_0.2-7
>       [5] RCurl_1.95-4.1     RSQLite_0.11.4     rtracklayer_1.23.1
>     stats4_3.1.0
>       [9] tools_3.1.0        XML_3.98-1.1       zlibbioc_1.9.0
>
>
>     Best,
>     Stefanie
>
>     On Mo, 28.10.2013, 14:56, Michael Lawrence wrote:
>      > Probably need to update GenomicRanges.
>      >
>      >
>      > On Mon, Oct 28, 2013 at 5:11 AM, Stefanie Tauber <
>      > stefanie.tauber at univie.ac.at
>     <mailto:stefanie.tauber at univie.ac.at>> wrote:
>      >
>      >> Hi Hervé,
>      >>
>      >> I just wanted to try the new coverage function:
>      >>
>      >> # I read in a bam file, human data mapped against the transcriptome
>      >> x = readGAlignments(human_trans_bam)
>      >>
>      >> # This is my GAligments object
>      >> > x
>      >> GAlignments with 46765032 alignments and 0 metadata columns:
>      >>                                  seqnames strand       cigar
>       qwidth
>      >>                                     <Rle>  <Rle> <character>
>     <integer>
>      >>          [1] hg19_ensGene_ENST00000237247      +         75M
>         75
>      >>          [2] hg19_ensGene_ENST00000371036      -         75M
>         75
>      >>          [3] hg19_ensGene_ENST00000371037      -         75M
>         75
>      >>          [4] hg19_ensGene_ENST00000471889      -         75M
>         75
>      >>          [5] hg19_ensGene_ENST00000377479      +         75M
>         75
>      >>          ...                          ...    ...         ...
>        ...
>      >>   [46765028] hg19_ensGene_ENST00000344867      -         75M
>         75
>      >>   [46765029] hg19_ensGene_ENST00000400848      -         75M
>         75
>      >>   [46765030] hg19_ensGene_ENST00000400848      -         75M
>         75
>      >>   [46765031] hg19_ensGene_ENST00000400848      -         75M
>         75
>      >>   [46765032] hg19_ensGene_ENST00000400829      -         75M
>         75
>      >>                  start       end     width      ngap
>      >>              <integer> <integer> <integer> <integer>
>      >>          [1]      1783      1857        75         0
>      >>          [2]       690       764        75         0
>      >>          [3]      1632      1706        75         0
>      >>          [4]      2121      2195        75         0
>      >>          [5]       391       465        75         0
>      >>          ...       ...       ...       ...       ...
>      >>   [46765028]       280       354        75         0
>      >>   [46765029]       509       583        75         0
>      >>   [46765030]       509       583        75         0
>      >>   [46765031]       509       583        75         0
>      >>   [46765032]       179       253        75         0
>      >>   ---
>      >>   seqlengths:
>      >>    hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829
>      >>                            2580 ...                          999
>      >>
>      >> When I want to compute the coverage, I get the following:
>      >> > cov_x = coverage(x)
>      >>
>      >> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
>      >>   object '.Ranges.coverage' not found
>      >>
>      >> Do you know what I am doing wrong?
>      >>
>      >> Thanks,
>      >> Stefanie
>      >>
>      >>
>      >>
>      >> On Fr, 25.10.2013, 11:44, Hervé Pagès wrote:
>      >> > Hi Stephanie,
>      >> >
>      >> > Just to let you know that in GenomicRanges 1.14.2, coverage()
>     should
>      >> > now be much faster on objects with many seqlevels. Should be about
>      >> > 1000x faster or more if you have 50 000 seqlevels.
>      >> >
>      >> > The only case where it's still slow is if many seqlevels are
>     marked as
>      >> > circular (and have a non-NA seqlength) in seqinfo(x), because for
>      >> > these seqlevels the coverage vector is still post processed by a
>      >> > loop in R. Could be a problem if for example people have BAM files
>      >> > with reads aligned to thousands of bacterial genomes.
>      >> >
>      >> > GenomicRanges 1.14.2 and IRanges 1.20.1 should become
>     available thru
>      >> > biocLite() in the next 24 hours or so. The man pages for the
>      >> "coverage"
>      >> > methods have been improved, especially in IRanges.
>      >> >
>      >> > Cheers,
>      >> > H.
>      >> >
>      >> >
>      >> > On 10/18/2013 12:19 AM, Stefanie Tauber wrote:
>      >> >> Thanks for the answer!
>      >> >> Just wanted to make sure that I don't miss a relevant feature….
>      >> >>
>      >> >> Best,
>      >> >> Stefanie
>      >> >>
>      >> >> Am 17.10.2013 um 22:07 schrieb Hervé Pagès <hpages at fhcrc.org
>     <mailto:hpages at fhcrc.org>
>      >> >> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>>:
>      >> >>
>      >> >>> Hi Stephanie,
>      >> >>>
>      >> >>> On 10/17/2013 07:37 AM, Stefanie Tauber wrote:
>      >> >>>> What is the best way to compute the coverage on a
>     ReadGAlignments
>      >> >>>> object
>      >> >>>> when I have many, many (about 50 000) seqlevels?
>      >> >>>>
>      >> >>>> Just computing coverage(object) is terribly slow due to the
>     huge
>      >> >>>> amount of seqlevels.
>      >> >>>>
>      >> >>>> At the moment, I am subsetting the ReadGAlignments object
>     for each
>      >> >>>> seqlevel,
>      >> >>>>  I omit the redundant seqlevels and calculate the coverage.
>      >> >>>> This is fast if you do it for a few seqlevels.
>      >> >>>> But if one really would like to compute the coverage for each
>      >> >>>> seqlevel individually,
>      >> >>>> this really takes a long time..
>      >> >>>
>      >> >>> Yes this is a known weakness in the current implementation of
>      >> >>> coverage(). It loops over the seqlevels and this loop is
>     currently
>      >> >>> performed in R, which is slow. Unfortunately this is not the
>     first
>      >> >>> time someone complains about this.
>      >> >>>
>      >> >>> Time for us to bite the bullet I guess. I've just started to
>     work
>      >> >>> on moving the loop to the C-level. That should speed up things
>      >> >>> significantly. I'll let you know when this is ready.
>      >> >>>
>      >> >>> Thanks for the feedback.
>      >> >>> H.
>      >> >>>
>      >> >>>>
>      >> >>>>
>      >> >>>> I would appreciate any comments!
>      >> >>>>
>      >> >>>> Best,
>      >> >>>> Stefanie
>      >> >>>>
>      >> >>>> _______________________________________________
>      >> >>>> Bioconductor mailing list
>      >> >>>> Bioconductor at r-project.org
>     <mailto:Bioconductor at r-project.org>
>     <mailto:Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>>
>      >> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>      >> >>>> Search the archives:
>      >> >>>>
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>      >> >>>>
>      >> >>>
>      >> >>> --
>      >> >>> Hervé Pagès
>      >> >>>
>      >> >>> Program in Computational Biology
>      >> >>> Division of Public Health Sciences
>      >> >>> Fred Hutchinson Cancer Research Center
>      >> >>> 1100 Fairview Ave. N, M1-B514
>      >> >>> P.O. Box 19024
>      >> >>> Seattle, WA 98109-1024
>      >> >>>
>      >> >>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>      >> >>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>      >> >>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>      >>
>      >> _______________________________________________
>      >> Bioconductor mailing list
>      >> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>      >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>      >> Search the archives:
>      >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>      >>
>      >
>
>
>     --
>     _____________________________________________________________
>
>     Stefanie Tauber, PhD
>
>     Center for Integrative Bioinformatics Vienna (CIBIV)
>     Max F. Perutz Laboratories (MFPL)
>     Campus Vienna Biocenter 5 (VBC5), Ebene 1, Room 1812.2
>     Dr. Bohr Gasse 9
>     A-1030 Wien, Austria
>     Phone: ++43 +1 / 42772-4030
>     Fax:   ++43 +1 / 42772-4098
>     email: stefanie.tauber at univie.ac.at
>     <mailto:stefanie.tauber at univie.ac.at>
>
>     _____________________________________________________________
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list