[BioC] how to obtain base counts from BAM file at specific genomic coordinates

Tue Oct 22 11:58:27 CEST 2013

Hi Kim,

On 10/18/2013 11:04 PM, Kim Siegmund wrote:
> Dear Herve,
>
> I work with Tim Triche Jr. at USC and he told me you were the best person to contact about a particular programming question I have. I would like to estimate the SNP variant frequencies in a BAM file at a list of specific locations. Specifically, I have a file with a list of somatic SNP variant locations, e.g.
> chr1	31184771	
> chr1	40229367
> chr1	48699354
> …. (~500 rows),
> and a BAM file of whole exon sequencing reads at ~ 40X coverage. I'd like to know the base calls in the BAM file at the specific locations from the file of somatic variant positions.   Is there an easy way to extract this information from the BAM file?  If yes, can I also get the quality score for each base?
>
> I am familiar with GenomicRanges & ReadGappedAlignment, but perhaps at only an intermediate level.  I was told this is easy in samtools (is it the mpileup command?) but cannot find any examples showing how this might be done in Bioconductor. Although it seems the package VariantTools must have this capability, I did not see from the vignette how I might do this targeted data summarization.

Maybe you could ask Michael about VariantTools.

Here below is a pure Biostrings + GenomicRanges solution. It uses
the pileLettersAt() function, which isn't in any of those packages
yet, but will be included in a package to be added soon to Bioc-devel.

Here is how to use pileLettersAt():

Input:

- A BAM file:

     bamfile <- BamFile(system.file("extdata", "ex1.bam", 
package="Rsamtools"))

     seqinfo(bamfile)  # to see the seqlevels and seqlengths

     stackStringsFromBam(bamfile, param="seq1:1-21")  # a quick look at
                                                      # the reads

- A GRanges object containing single nucleotide positions:

     snpos <- GRanges(Rle(c("seq1", "seq2"), c(7, 2)),
                      IRanges(c(1:5, 21, 1575, 1513:1514), width=1))

Some preliminary massage on 'snpos':

   seqinfo(snpos) <- merge(seqinfo(snpos), seqinfo(bamfile))
   seqlevels(snpos) <- seqlevelsInUse(snpos)

Load the BAM file in a GAlignments object. We load only the reads
aligned to the sequences in 'seqlevels(snpos)' and we filter out
reads not passing quality controls (flag bit 0x200) and PCR or
optical duplicates (flag bit 0x400). See ?ScanBamParam and the SAM
Spec for more information.

   which <- as(seqinfo(snpos), "GRanges")
   flag <- scanBamFlag(isNotPassingQualityControls=FALSE,
                       isDuplicate=FALSE)
   what <- c("seq", "qual")
   param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which)
   gal <- readGAlignmentsFromBam(bamfile, param=param)
   seqlevels(gal) <- seqlevels(snpos)

Extract the read sequences (a.k.a. query sequences) and quality
strings. Both are reported with respect to the + strand.

   qseq <- mcols(gal)$seq
   qual <- mcols(gal)$qual

   nucl_piles <- pileLettersAt(qseq, seqnames(gal), start(gal), 
cigar(gal), snpos)
   qual_piles <- pileLettersAt(qual, seqnames(gal), start(gal), 
cigar(gal), snpos)
   mcols(snpos)$nucl_piles <- nucl_piles
   mcols(snpos)$qual_piles <- qual_piles

Then:

   > snpos
   GRanges with 9 ranges and 2 metadata columns:
         seqnames       ranges strand |                  nucl_piles
            <Rle>    <IRanges>  <Rle> |              <DNAStringSet>
     [1]     seq1 [   1,    1]      * |                           C
     [2]     seq1 [   2,    2]      * |                           A
     [3]     seq1 [   3,    3]      * |                          CC
     [4]     seq1 [   4,    4]      * |                          TT
     [5]     seq1 [   5,    5]      * |                         AAA
     [6]     seq1 [  21,   21]      * |                   TTTTTTTTT
     [7]     seq1 [1575, 1575]      * |
     [8]     seq2 [1513, 1513]      * |        GGGGGGGGGGGGGGGGTTGG
     [9]     seq2 [1514, 1514]      * | TTTTTTTTTTTTTTTTTTTTTTTTTAT
                          qual_piles
                        <BStringSet>
     [1]                           <
     [2]                           <
     [3]                          <<
     [4]                          <<
     [5]                         <<<
     [6]                   <<<-<;<<=
     [7]
     [8]        <<<<;6:<=4<;16<<'+:.
     [9] <<<;;=<<=<<<<<<:;'84;;<<;+1
     ---
     seqlengths:
      seq1 seq2
      1575 1584

Finally, to summarize A/C/G/T frequencies at each position:

   > alphabetFrequency(nucl_piles, baseOnly=TRUE)
         A C  G  T other
    [1,] 0 1  0  0     0
    [2,] 1 0  0  0     0
    [3,] 0 2  0  0     0
    [4,] 0 0  0  2     0
    [5,] 3 0  0  0     0
    [6,] 0 0  0  9     0
    [7,] 0 0  0  0     0
    [8,] 0 0 18  2     0
    [9,] 1 0  0 26     0

See below for pileLettersAt's code.

Cheers,
H.

-----------------------------------------------------------------------

### Conceptually equivalent to:
###     sapply(x, function(xx) paste(xx, collapse=""))

collapse_XStringSetList <- function(x)
{
     unlisted_x <- unlist(x, use.names=FALSE)
     unlisted_ans <- unlist(unlisted_x, use.names=FALSE)
     ans_width <- sum(relist(width(unlisted_x), x))
     extract_at <- as(PartitioningByEnd(cumsum(ans_width)), "IRanges")
     extractAt(unlisted_ans, extract_at)
}

### pileLettersOnSingleRefAt() is the workhorse behind pileLettersAt().
### 'x', 'pos', 'cigar': 3 parallel vectors describing N strings aligned
### to the same reference sequence. 'x' must be an XStringSet (typically
### DNAStringSet) object containing the unaligned strings (a.k.a. the
### query sequences) reported with respect to the + strand. 'pos' must
### be an integer vector where 'pos[i]' is the 1-based position on the
### reference sequence of the first aligned letter in 'x[[i]]'. 'cigar'
### must be a character vector containing the extended CIGAR strings.
### 'at': must be an integer vector containing single nucleotide
### positions on the reference sequence.
### Returns an XStringSet (typically DNAStringSet) object parallel to
### 'at' (i.e. with 1 string per single nucleotide position).

pileLettersOnSingleRefAt <- function(x, pos, cigar, at)
{
     stopifnot(is(x, "XStringSet"))
     N <- length(x)  # nb of alignments
     stopifnot(is.integer(pos) && length(pos) == N)
     stopifnot(is.character(cigar) && length(cigar) == N)
     stopifnot(is.integer(at))

     ops <- c("M", "=", "X")
     ranges_on_ref <- cigarRangesAlongReferenceSpace(cigar, pos=pos, 
ops=ops)
     ranges_on_query <- cigarRangesAlongQuerySpace(cigar, ops=ops)

     ## 'ranges_on_ref' and 'ranges_on_query' are IRangesList objects 
parallel
     ## to 'x', 'pos', and 'cigar'. In addition, the 2 IRangesList objects
     ## have the same "shape" (i.e. same elementLengths()), so, after
     ## unlisting, the 2 unlisted objects are parallel IRanges objects.
     unlisted_ranges_on_ref <- unlist(ranges_on_ref, use.names=FALSE)
     unlisted_ranges_on_query <- unlist(ranges_on_query, use.names=FALSE)

     ## 2 integer vectors parallel to IRanges objects 
'unlisted_ranges_on_ref'
     ## and 'unlisted_ranges_on_query' above.
     range_group <- togroup(ranges_on_ref)
     query2ref_shift <- start(unlisted_ranges_on_ref) -
                        start(unlisted_ranges_on_query)

     hits <- findOverlaps(at, unlisted_ranges_on_ref)
     hits_at_in_x <- at[queryHits(hits)] - 
query2ref_shift[subjectHits(hits)]
     hits_group <- range_group[subjectHits(hits)]
     unlisted_piles <- subseq(x[hits_group], start=hits_at_in_x, width=1L)
     piles_skeleton <- PartitioningByEnd(queryHits(hits), NG=length(at),
                                        names=names(at))
     piles <- relist(unlisted_piles, piles_skeleton)
     collapse_XStringSetList(piles)
}

### 'x', 'seqnames', 'pos', 'cigar': 4 parallel vectors describing N
### aligned strings. 'x', 'pos', and 'cigar' as above. 'seqnames' must
### be a factor-Rle where 'seqnames[i]' is the name of the reference
### sequence of the i-th alignment.
### 'at': must be a GRanges object containing single nucleotide
### positions. 'seqlevels(at)' must be identical to 'levels(seqnames)'.
### Returns an XStringSet (typically DNAStringSet) object parallel to
### 'at' (i.e. with 1 string per single nucleotide position).

pileLettersAt <- function(x, seqnames, pos, cigar, at)
{
     stopifnot(is(x, "XStringSet"))
     N <- length(x)  # nb of alignments
     stopifnot(is(seqnames, "Rle"))
     stopifnot(is.factor(runValue(seqnames)))
     stopifnot(length(seqnames) == N)
     stopifnot(is.integer(pos) && length(pos) == N)
     stopifnot(is.character(cigar) && length(cigar) == N)
     stopifnot(is(at, "GRanges"))
     stopifnot(all(width(at) == 1L))
     stopifnot(identical(seqlevels(at), levels(seqnames)))

     ## We process 1 chromosome at a time. So we start by splitting
     ## 'x', 'pos', 'cigar', and 'start(at)' by chromosome. The 4
     ## resulting list-like objects have 1 list element per chromosome
     ## in 'seqlevels(at)' (or in 'levels(seqnames)', which is identical
     ## to 'seqlevels(at)').
     x_by_chrom <- split(x, seqnames)               # XStringSetList
     pos_by_chrom <- split(pos, seqnames)           # IntegerList
     cigar_by_chrom <- split(cigar, seqnames)       # CharacterList
     at_by_chrom <- split(start(at), seqnames(at))  # IntegerList

     ## Unsplit index.
     split_idx <- unlist(split(seq_along(at), seqnames(at)),
                         use.names=FALSE)
     unsplit_idx <- integer(length(at))
     unsplit_idx[split_idx] <- seq_along(at)

     do.call("c", lapply(seq_along(seqlevels(at)),
         function(i)
             pileLettersOnSingleRefAt(
                 x_by_chrom[[i]],
                 pos_by_chrom[[i]],
                 cigar_by_chrom[[i]],
                 at_by_chrom[[i]])))[unsplit_idx]
}

>
> Thank you in advance for any advice.
>
> Kim Siegmund
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319