[BioC] coverage on very large ReadGappedAlignmentsObject

Wed May 8 14:34:32 CEST 2013

On 05/08/2013 04:40 AM, Stefanie Tauber wrote:
> Dear list,
>
> I have some human data, mapped with bowtie to the transcriptome and then
> further processed with eXpress.
> The output of eXpress is a sorted bam file, containing the alignments of
> the reads to different transcripts.
>
> So, the bam file contains about 48 x 10^6 alignments, distributed over
> about 50000 transcripts.
> I read in the bam file with 'readGappedAlignments'.
>
>
>> reads
> GappedAlignments with 46765032 alignments and 0 metadata columns:
>                                   seqnames strand       cigar    qwidth
>                                      <Rle>  <Rle> <character> <integer>
>           [1] hg19_ensGene_ENST00000237247      +         75M        75
>           [2] hg19_ensGene_ENST00000371036      -         75M        75
>           [3] hg19_ensGene_ENST00000371037      -         75M        75
>           [4] hg19_ensGene_ENST00000471889      -         75M        75
>           [5] hg19_ensGene_ENST00000377479      +         75M        75
>           ...                          ...    ...         ...       ...
>    [46765028] hg19_ensGene_ENST00000344867      -         75M        75
>    [46765029] hg19_ensGene_ENST00000400848      -         75M        75
>    [46765030] hg19_ensGene_ENST00000400848      -         75M        75
>    [46765031] hg19_ensGene_ENST00000400848      -         75M        75
>    [46765032] hg19_ensGene_ENST00000400829      -         75M        75
>                   start       end     width      ngap
>               <integer> <integer> <integer> <integer>
>           [1]      1783      1857        75         0
>           [2]       690       764        75         0
>           [3]      1632      1706        75         0
>           [4]      2121      2195        75         0
>           [5]       391       465        75         0
>           ...       ...       ...       ...       ...
>    [46765028]       280       354        75         0
>    [46765029]       509       583        75         0
>    [46765030]       509       583        75         0
>    [46765031]       509       583        75         0
>    [46765032]       179       253        75         0
>    ---
>    seqlengths:
>     hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829
>                             2580 ...                          999
>
>
>
>
> Now, for whatever reason, I would like to have the coverage vector for
> each transcript.
> So, basically I would just compute
> coverage(reads).
>
> This works very well for smaller genomes and experiments (e.g. yeast with
> 10 x 10^6 alignments).
>
> But, using 'coverage' on this large ReadGappedAlignments Object takes
> really very very long.

Hi Stefanie --

One reason might be that you are on the edge of physical memory availability and 
you are swapping to disk. A solution is to read in chunks, which turns out to be 
very easy

   bam = BamFile(<path/to/bam>, yieldSize = 10000000)
   open(bam)
   cvg = NULL
   while (length(reads <- readGappedAlignments(bam, <...>))) {
       if (cvg == NULL)
           cvg = coverage(reads)
       else cvg = cvg + coverage(reads)
   }
   close(bam)

This will be relatively efficient; yieldSize can be adjusted, including to a 
size so that several bam files can be processed in parallel. Start with a small 
yieldSize to quickly debug the iteration, before processing the whole file. It 
can be further simplified using coverage,BamFile-method directly, if you're not 
wanting to do anything else with the gapped alignments.

Martin

> Is there a way to do this better (in R)? Or is this just not feasible to
> do in R and I should rather preprocess the data outside R?
>
> I would appreciate any comments,
> best,
> Stefanie
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793