[BioC] coverage on very large ReadGappedAlignmentsObject

Wed May 8 23:34:47 CEST 2013

Hi,

On 05/08/2013 08:50 AM, Kasper Daniel Hansen wrote:
> I have observed big slowdowns with GRanges when the number of seqlevels is
> large; not sure if this is related.  My observation was mostly about
> findOverlaps.

Right. It's the same issue here: the "coverage" method for
GenomicRanges objects (which is what is called behind the scene when
calling coverage() on a GappedAlignments object) is looping over the
seqlevels (with an lapply). The overhead of that loop is big:

   ## Only 1 seqlevel:
   gr <- GRanges("chr1", IRanges(35:5, width=10))
   seqlengths(gr) <- 100

   > system.time(cvg1 <- coverage(gr))
      user  system elapsed
     0.016   0.000   0.013

   ## Add 50000 artificial seqlevels:
   fakechroms <- Seqinfo(paste0("fakechr", 1:50000),
                         seqlengths=rep(200, 50000))
   seqinfo(gr) <- merge(seqinfo(gr), fakechroms)

   > system.time(cvg2 <- coverage(gr))
      user  system elapsed
   134.744   0.376 135.417

Moving the loop at the C level would probably help. I will look into
that and report back here when it's ready.

Cheers,
H.

>
>
> On Wed, May 8, 2013 at 11:27 AM, Stefanie Tauber <
> stefanie.tauber at univie.ac.at> wrote:
>
>> To be more specific,
>> it takes about 10 minutes to calculate the coverage of 1 x 10^6 reads on
>> my pc (linux, 4 cpus, 8GB RAM)
>>
>>
>> This seems a bit long to me, what do you think?
>>
>>
>>
>> Am 08.05.2013 um 17:09 schrieb Steve Lianoglou <lianoglou.steve at gene.com>:
>>
>>> Another approach to what Martin suggests would be to simply read in the
>> reads and process them (do the coverage calculation per transcript stuff)
>> one chromosome at a time.
>>>
>>> On Wednesday, May 8, 2013, Martin Morgan wrote:
>>> On 05/08/2013 04:40 AM, Stefanie Tauber wrote:
>>> Dear list,
>>>
>>> I have some human data, mapped with bowtie to the transcriptome and then
>>> further processed with eXpress.
>>> The output of eXpress is a sorted bam file, containing the alignments of
>>> the reads to different transcripts.
>>>
>>> So, the bam file contains about 48 x 10^6 alignments, distributed over
>>> about 50000 transcripts.
>>> I read in the bam file with 'readGappedAlignments'.
>>>
>>>
>>> reads
>>> GappedAlignments with 46765032 alignments and 0 metadata columns:
>>>                                    seqnames strand       cigar    qwidth
>>>                                       <Rle>  <Rle> <character> <integer>
>>>            [1] hg19_ensGene_ENST00000237247      +         75M        75
>>>            [2] hg19_ensGene_ENST00000371036      -         75M        75
>>>            [3] hg19_ensGene_ENST00000371037      -         75M        75
>>>            [4] hg19_ensGene_ENST00000471889      -         75M        75
>>>            [5] hg19_ensGene_ENST00000377479      +         75M        75
>>>            ...                          ...    ...         ...       ...
>>>     [46765028] hg19_ensGene_ENST00000344867      -         75M        75
>>>     [46765029] hg19_ensGene_ENST00000400848      -         75M        75
>>>     [46765030] hg19_ensGene_ENST00000400848      -         75M        75
>>>     [46765031] hg19_ensGene_ENST00000400848      -         75M        75
>>>     [46765032] hg19_ensGene_ENST00000400829      -         75M        75
>>>                    start       end     width      ngap
>>>                <integer> <integer> <integer> <integer>
>>>            [1]      1783      1857        75         0
>>>            [2]       690       764        75         0
>>>            [3]      1632      1706        75         0
>>>            [4]      2121      2195        75         0
>>>            [5]       391       465        75         0
>>>            ...       ...       ...       ...       ...
>>>     [46765028]       280       354        75         0
>>>     [46765029]       509       583        75         0
>>>     [46765030]       509       583        75         0
>>>     [46765031]       509       583        75         0
>>>     [46765032]       179       253        75         0
>>>     ---
>>>     seqlengths:
>>>      hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829
>>>                              2580 ...                          999
>>>
>>>
>>>
>>>
>>> Now, for whatever reason, I would like to have the coverage vector for
>>> each transcript.
>>> So, basically I would just compute
>>> coverage(reads).
>>>
>>> This works very well for smaller genomes and experiments (e.g. yeast with
>>> 10 x 10^6 alignments).
>>>
>>> But, using 'coverage' on this large ReadGappedAlignments Object takes
>>> really very very long.
>>>
>>> Hi Stefanie --
>>>
>>> One reason might be that you are on the edge of physical memory
>> availability and you are swapping to disk. A solution is to read in chunks,
>> which turns out to be very easy
>>>
>>>    bam = BamFile(<path/to/bam>, yieldSize = 10000000)
>>>    open(bam)
>>>    cvg = NULL
>>>    while (length(reads <- readGappedAlignments(bam, <...>))) {
>>>        if (cvg == NULL)
>>>            cvg = coverage(reads)
>>>        else cvg = cvg + coverage(reads)
>>>    }
>>>    close(bam)
>>>
>>> This will be relatively efficient; yieldSize can be adjusted, including
>> to a size so that several bam files can be processed in parallel. Start
>> with a small yieldSize to quickly debug the iteration, before processing
>> the whole file. It can be further simplified using coverage,BamFile-method
>> directly, if you're not wanting to do anything else with the gapped
>> alignments.
>>>
>>> Martin
>>>
>>>
>>> Is there a way to do this better (in R)? Or is this just not feasible to
>>> do in R and I should rather preprocess the data outside R?
>>>
>>> I would appreciate any comments,
>>> best,
>>> Stefanie
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>> --
>>> Steve Lianoglou
>>> Computational Biologist
>>> Department of Bioinformatics and Computational Biology
>>> Genentech
>>>
>>
>> DI Stefanie Tauber
>>
>> Center for Integrative Bioinformatics Vienna (CIBIV)
>> (CIBIV is a joint institute of Vienna University and Medical University)
>> Max F. Perutz Laboratories (MFPL)
>> Campus Vienna Biocenter 5 (VBC5), Ebene 1, Room 1812.2
>> Dr. Bohr Gasse 9
>> A-1030 Wien, Austria
>> Phone: ++43 +1 / 42772-4030
>> Fax:     ++43 +1 / 42772-4098
>> email:   stefanie.tauber at univie.ac.at
>> www.cibiv.at
>>
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319