[BioC] [devteam-bioc] Coverage of reads per base position from BAM files

Wed Jul 18 00:07:45 CEST 2012

On 07/17/2012 12:56 PM, Maintainer wrote:
>
> I am trying to obtain per reference base position coverage for a BAM file. The BAM file has been generated by Samtools and contains alignment info for 10 million reads to a 7kb reference plasmid genome. What is the best way to go about this? I have explored the "readBamGappedAlignments" and "readAligned" functions but I am not sure of the way forward from there.
>
> Also should I be using a BAM file or the sorted BAM file with its corresponding index file? I am currently using the later.

For simple depth of coverage follow Michaels' advice wrt 
readGappedAlignments (where 10 million reads might fit into memory w/out 
trouble, so no need to sort / index / query the BAM file though that 
doesn't hurt).

If by 'per reference base position coverage' you mean the number of A, 
C, G, T above each position, you'll want Rsamtools::applyPileups. See 
?applyPileups for a worked example; this isn't as easy as it should be.

Normally, you'll want to use a sorted, indexed BAM file (?sortBam, 
?indexBam) so that you can query just the regions you're interested in.
One oddity of the BAM 'index' is that it indexes intervals on the 
reference genome, rather than say dividing the reads into bins of equal 
numbers. The smallest interval for indexing is 2^14 = 16kb, so in your 
case all your 10 million reads fall in a single bin and any search is 
linear(!) -- the first read at the first position is found fast, the 
last read at the last position only after looking at 9,999,999 other 
reads. If you want to process the whole file then iterating through it 
(via PileupParam yieldSize) works; random access will be terrible.

Very recent samtools can generate more flexible indexes, but I think the 
resolution (with respect to the reference genome) is still relatively 
course and I have not tested files with these indexes in Rsamtools.

Martin

> Thanks much.
>
>
>   -- output of sessionInfo():
>
> R version 2.14.2 (2012-02-29)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] ShortRead_1.12.4    latticeExtra_0.6-19 RColorBrewer_1.0-5
> [4] lattice_0.20-6      Rsamtools_1.6.3     Biostrings_2.22.0
> [7] GenomicRanges_1.6.7 IRanges_1.12.6
>
> loaded via a namespace (and not attached):
>   [1] Biobase_2.14.0     bitops_1.0-4.1     BSgenome_1.22.0    grid_2.14.2
>   [5] hwriter_1.3        RCurl_1.91-1       rtracklayer_1.14.4 tools_2.14.2
>   [9] XML_3.9-4          zlibbioc_1.0.1
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793