[BioC] how to obtain base counts from BAM file at specific genomic coordinates

Tue Oct 22 19:01:25 CEST 2013

On 10/22/2013 02:58 AM, Hervé Pagès wrote:
> Hi Kim,
>
> On 10/18/2013 11:04 PM, Kim Siegmund wrote:
>> Dear Herve,
>>
>> I work with Tim Triche Jr. at USC and he told me you were the best person to
>> contact about a particular programming question I have. I would like to
>> estimate the SNP variant frequencies in a BAM file at a list of specific
>> locations. Specifically, I have a file with a list of somatic SNP variant
>> locations, e.g.
>> chr1    31184771
>> chr1    40229367
>> chr1    48699354
>> …. (~500 rows),

If you're interested in nucleotide counts, then ?applyPileups will provide this. 
Martin

>> and a BAM file of whole exon sequencing reads at ~ 40X coverage. I'd like to
>> know the base calls in the BAM file at the specific locations from the file of
>> somatic variant positions.   Is there an easy way to extract this information
>> from the BAM file?  If yes, can I also get the quality score for each base?
>>
>> I am familiar with GenomicRanges & ReadGappedAlignment, but perhaps at only an
>> intermediate level.  I was told this is easy in samtools (is it the mpileup
>> command?) but cannot find any examples showing how this might be done in
>> Bioconductor. Although it seems the package VariantTools must have this
>> capability, I did not see from the vignette how I might do this targeted data
>> summarization.
>
> Maybe you could ask Michael about VariantTools.
>
> Here below is a pure Biostrings + GenomicRanges solution. It uses
> the pileLettersAt() function, which isn't in any of those packages
> yet, but will be included in a package to be added soon to Bioc-devel.
>
> Here is how to use pileLettersAt():
>
> Input:
>
> - A BAM file:
>
>      bamfile <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools"))
>
>      seqinfo(bamfile)  # to see the seqlevels and seqlengths
>
>      stackStringsFromBam(bamfile, param="seq1:1-21")  # a quick look at
>                                                       # the reads
>
> - A GRanges object containing single nucleotide positions:
>
>      snpos <- GRanges(Rle(c("seq1", "seq2"), c(7, 2)),
>                       IRanges(c(1:5, 21, 1575, 1513:1514), width=1))
>
> Some preliminary massage on 'snpos':
>
>    seqinfo(snpos) <- merge(seqinfo(snpos), seqinfo(bamfile))
>    seqlevels(snpos) <- seqlevelsInUse(snpos)
>
> Load the BAM file in a GAlignments object. We load only the reads
> aligned to the sequences in 'seqlevels(snpos)' and we filter out
> reads not passing quality controls (flag bit 0x200) and PCR or
> optical duplicates (flag bit 0x400). See ?ScanBamParam and the SAM
> Spec for more information.
>
>    which <- as(seqinfo(snpos), "GRanges")
>    flag <- scanBamFlag(isNotPassingQualityControls=FALSE,
>                        isDuplicate=FALSE)
>    what <- c("seq", "qual")
>    param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which)
>    gal <- readGAlignmentsFromBam(bamfile, param=param)
>    seqlevels(gal) <- seqlevels(snpos)
>
> Extract the read sequences (a.k.a. query sequences) and quality
> strings. Both are reported with respect to the + strand.
>
>    qseq <- mcols(gal)$seq
>    qual <- mcols(gal)$qual
>
>    nucl_piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal), snpos)
>    qual_piles <- pileLettersAt(qual, seqnames(gal), start(gal), cigar(gal), snpos)
>    mcols(snpos)$nucl_piles <- nucl_piles
>    mcols(snpos)$qual_piles <- qual_piles
>
> Then:
>
>    > snpos
>    GRanges with 9 ranges and 2 metadata columns:
>          seqnames       ranges strand |                  nucl_piles
>             <Rle>    <IRanges>  <Rle> |              <DNAStringSet>
>      [1]     seq1 [   1,    1]      * |                           C
>      [2]     seq1 [   2,    2]      * |                           A
>      [3]     seq1 [   3,    3]      * |                          CC
>      [4]     seq1 [   4,    4]      * |                          TT
>      [5]     seq1 [   5,    5]      * |                         AAA
>      [6]     seq1 [  21,   21]      * |                   TTTTTTTTT
>      [7]     seq1 [1575, 1575]      * |
>      [8]     seq2 [1513, 1513]      * |        GGGGGGGGGGGGGGGGTTGG
>      [9]     seq2 [1514, 1514]      * | TTTTTTTTTTTTTTTTTTTTTTTTTAT
>                           qual_piles
>                         <BStringSet>
>      [1]                           <
>      [2]                           <
>      [3]                          <<
>      [4]                          <<
>      [5]                         <<<
>      [6]                   <<<-<;<<=
>      [7]
>      [8]        <<<<;6:<=4<;16<<'+:.
>      [9] <<<;;=<<=<<<<<<:;'84;;<<;+1
>      ---
>      seqlengths:
>       seq1 seq2
>       1575 1584
>
> Finally, to summarize A/C/G/T frequencies at each position:
>
>    > alphabetFrequency(nucl_piles, baseOnly=TRUE)
>          A C  G  T other
>     [1,] 0 1  0  0     0
>     [2,] 1 0  0  0     0
>     [3,] 0 2  0  0     0
>     [4,] 0 0  0  2     0
>     [5,] 3 0  0  0     0
>     [6,] 0 0  0  9     0
>     [7,] 0 0  0  0     0
>     [8,] 0 0 18  2     0
>     [9,] 1 0  0 26     0
>
> See below for pileLettersAt's code.
>
> Cheers,
> H.
>
>
> -----------------------------------------------------------------------
>
> ### Conceptually equivalent to:
> ###     sapply(x, function(xx) paste(xx, collapse=""))
>
> collapse_XStringSetList <- function(x)
> {
>      unlisted_x <- unlist(x, use.names=FALSE)
>      unlisted_ans <- unlist(unlisted_x, use.names=FALSE)
>      ans_width <- sum(relist(width(unlisted_x), x))
>      extract_at <- as(PartitioningByEnd(cumsum(ans_width)), "IRanges")
>      extractAt(unlisted_ans, extract_at)
> }
>
> ### pileLettersOnSingleRefAt() is the workhorse behind pileLettersAt().
> ### 'x', 'pos', 'cigar': 3 parallel vectors describing N strings aligned
> ### to the same reference sequence. 'x' must be an XStringSet (typically
> ### DNAStringSet) object containing the unaligned strings (a.k.a. the
> ### query sequences) reported with respect to the + strand. 'pos' must
> ### be an integer vector where 'pos[i]' is the 1-based position on the
> ### reference sequence of the first aligned letter in 'x[[i]]'. 'cigar'
> ### must be a character vector containing the extended CIGAR strings.
> ### 'at': must be an integer vector containing single nucleotide
> ### positions on the reference sequence.
> ### Returns an XStringSet (typically DNAStringSet) object parallel to
> ### 'at' (i.e. with 1 string per single nucleotide position).
>
> pileLettersOnSingleRefAt <- function(x, pos, cigar, at)
> {
>      stopifnot(is(x, "XStringSet"))
>      N <- length(x)  # nb of alignments
>      stopifnot(is.integer(pos) && length(pos) == N)
>      stopifnot(is.character(cigar) && length(cigar) == N)
>      stopifnot(is.integer(at))
>
>      ops <- c("M", "=", "X")
>      ranges_on_ref <- cigarRangesAlongReferenceSpace(cigar, pos=pos, ops=ops)
>      ranges_on_query <- cigarRangesAlongQuerySpace(cigar, ops=ops)
>
>      ## 'ranges_on_ref' and 'ranges_on_query' are IRangesList objects parallel
>      ## to 'x', 'pos', and 'cigar'. In addition, the 2 IRangesList objects
>      ## have the same "shape" (i.e. same elementLengths()), so, after
>      ## unlisting, the 2 unlisted objects are parallel IRanges objects.
>      unlisted_ranges_on_ref <- unlist(ranges_on_ref, use.names=FALSE)
>      unlisted_ranges_on_query <- unlist(ranges_on_query, use.names=FALSE)
>
>      ## 2 integer vectors parallel to IRanges objects 'unlisted_ranges_on_ref'
>      ## and 'unlisted_ranges_on_query' above.
>      range_group <- togroup(ranges_on_ref)
>      query2ref_shift <- start(unlisted_ranges_on_ref) -
>                         start(unlisted_ranges_on_query)
>
>      hits <- findOverlaps(at, unlisted_ranges_on_ref)
>      hits_at_in_x <- at[queryHits(hits)] - query2ref_shift[subjectHits(hits)]
>      hits_group <- range_group[subjectHits(hits)]
>      unlisted_piles <- subseq(x[hits_group], start=hits_at_in_x, width=1L)
>      piles_skeleton <- PartitioningByEnd(queryHits(hits), NG=length(at),
>                                         names=names(at))
>      piles <- relist(unlisted_piles, piles_skeleton)
>      collapse_XStringSetList(piles)
> }
>
> ### 'x', 'seqnames', 'pos', 'cigar': 4 parallel vectors describing N
> ### aligned strings. 'x', 'pos', and 'cigar' as above. 'seqnames' must
> ### be a factor-Rle where 'seqnames[i]' is the name of the reference
> ### sequence of the i-th alignment.
> ### 'at': must be a GRanges object containing single nucleotide
> ### positions. 'seqlevels(at)' must be identical to 'levels(seqnames)'.
> ### Returns an XStringSet (typically DNAStringSet) object parallel to
> ### 'at' (i.e. with 1 string per single nucleotide position).
>
> pileLettersAt <- function(x, seqnames, pos, cigar, at)
> {
>      stopifnot(is(x, "XStringSet"))
>      N <- length(x)  # nb of alignments
>      stopifnot(is(seqnames, "Rle"))
>      stopifnot(is.factor(runValue(seqnames)))
>      stopifnot(length(seqnames) == N)
>      stopifnot(is.integer(pos) && length(pos) == N)
>      stopifnot(is.character(cigar) && length(cigar) == N)
>      stopifnot(is(at, "GRanges"))
>      stopifnot(all(width(at) == 1L))
>      stopifnot(identical(seqlevels(at), levels(seqnames)))
>
>      ## We process 1 chromosome at a time. So we start by splitting
>      ## 'x', 'pos', 'cigar', and 'start(at)' by chromosome. The 4
>      ## resulting list-like objects have 1 list element per chromosome
>      ## in 'seqlevels(at)' (or in 'levels(seqnames)', which is identical
>      ## to 'seqlevels(at)').
>      x_by_chrom <- split(x, seqnames)               # XStringSetList
>      pos_by_chrom <- split(pos, seqnames)           # IntegerList
>      cigar_by_chrom <- split(cigar, seqnames)       # CharacterList
>      at_by_chrom <- split(start(at), seqnames(at))  # IntegerList
>
>      ## Unsplit index.
>      split_idx <- unlist(split(seq_along(at), seqnames(at)),
>                          use.names=FALSE)
>      unsplit_idx <- integer(length(at))
>      unsplit_idx[split_idx] <- seq_along(at)
>
>      do.call("c", lapply(seq_along(seqlevels(at)),
>          function(i)
>              pileLettersOnSingleRefAt(
>                  x_by_chrom[[i]],
>                  pos_by_chrom[[i]],
>                  cigar_by_chrom[[i]],
>                  at_by_chrom[[i]])))[unsplit_idx]
> }
>
>
>>
>> Thank you in advance for any advice.
>>
>> Kim Siegmund
>>
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793