[BioC] how to obtain base counts from BAM file at specific genomic coordinates
Martin Morgan
mtmorgan at fhcrc.org
Tue Oct 22 19:01:25 CEST 2013
On 10/22/2013 02:58 AM, Hervé Pagès wrote:
> Hi Kim,
>
> On 10/18/2013 11:04 PM, Kim Siegmund wrote:
>> Dear Herve,
>>
>> I work with Tim Triche Jr. at USC and he told me you were the best person to
>> contact about a particular programming question I have. I would like to
>> estimate the SNP variant frequencies in a BAM file at a list of specific
>> locations. Specifically, I have a file with a list of somatic SNP variant
>> locations, e.g.
>> chr1 31184771
>> chr1 40229367
>> chr1 48699354
>> …. (~500 rows),
If you're interested in nucleotide counts, then ?applyPileups will provide this.
Martin
>> and a BAM file of whole exon sequencing reads at ~ 40X coverage. I'd like to
>> know the base calls in the BAM file at the specific locations from the file of
>> somatic variant positions. Is there an easy way to extract this information
>> from the BAM file? If yes, can I also get the quality score for each base?
>>
>> I am familiar with GenomicRanges & ReadGappedAlignment, but perhaps at only an
>> intermediate level. I was told this is easy in samtools (is it the mpileup
>> command?) but cannot find any examples showing how this might be done in
>> Bioconductor. Although it seems the package VariantTools must have this
>> capability, I did not see from the vignette how I might do this targeted data
>> summarization.
>
> Maybe you could ask Michael about VariantTools.
>
> Here below is a pure Biostrings + GenomicRanges solution. It uses
> the pileLettersAt() function, which isn't in any of those packages
> yet, but will be included in a package to be added soon to Bioc-devel.
>
> Here is how to use pileLettersAt():
>
> Input:
>
> - A BAM file:
>
> bamfile <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools"))
>
> seqinfo(bamfile) # to see the seqlevels and seqlengths
>
> stackStringsFromBam(bamfile, param="seq1:1-21") # a quick look at
> # the reads
>
> - A GRanges object containing single nucleotide positions:
>
> snpos <- GRanges(Rle(c("seq1", "seq2"), c(7, 2)),
> IRanges(c(1:5, 21, 1575, 1513:1514), width=1))
>
> Some preliminary massage on 'snpos':
>
> seqinfo(snpos) <- merge(seqinfo(snpos), seqinfo(bamfile))
> seqlevels(snpos) <- seqlevelsInUse(snpos)
>
> Load the BAM file in a GAlignments object. We load only the reads
> aligned to the sequences in 'seqlevels(snpos)' and we filter out
> reads not passing quality controls (flag bit 0x200) and PCR or
> optical duplicates (flag bit 0x400). See ?ScanBamParam and the SAM
> Spec for more information.
>
> which <- as(seqinfo(snpos), "GRanges")
> flag <- scanBamFlag(isNotPassingQualityControls=FALSE,
> isDuplicate=FALSE)
> what <- c("seq", "qual")
> param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which)
> gal <- readGAlignmentsFromBam(bamfile, param=param)
> seqlevels(gal) <- seqlevels(snpos)
>
> Extract the read sequences (a.k.a. query sequences) and quality
> strings. Both are reported with respect to the + strand.
>
> qseq <- mcols(gal)$seq
> qual <- mcols(gal)$qual
>
> nucl_piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal), snpos)
> qual_piles <- pileLettersAt(qual, seqnames(gal), start(gal), cigar(gal), snpos)
> mcols(snpos)$nucl_piles <- nucl_piles
> mcols(snpos)$qual_piles <- qual_piles
>
> Then:
>
> > snpos
> GRanges with 9 ranges and 2 metadata columns:
> seqnames ranges strand | nucl_piles
> <Rle> <IRanges> <Rle> | <DNAStringSet>
> [1] seq1 [ 1, 1] * | C
> [2] seq1 [ 2, 2] * | A
> [3] seq1 [ 3, 3] * | CC
> [4] seq1 [ 4, 4] * | TT
> [5] seq1 [ 5, 5] * | AAA
> [6] seq1 [ 21, 21] * | TTTTTTTTT
> [7] seq1 [1575, 1575] * |
> [8] seq2 [1513, 1513] * | GGGGGGGGGGGGGGGGTTGG
> [9] seq2 [1514, 1514] * | TTTTTTTTTTTTTTTTTTTTTTTTTAT
> qual_piles
> <BStringSet>
> [1] <
> [2] <
> [3] <<
> [4] <<
> [5] <<<
> [6] <<<-<;<<=
> [7]
> [8] <<<<;6:<=4<;16<<'+:.
> [9] <<<;;=<<=<<<<<<:;'84;;<<;+1
> ---
> seqlengths:
> seq1 seq2
> 1575 1584
>
> Finally, to summarize A/C/G/T frequencies at each position:
>
> > alphabetFrequency(nucl_piles, baseOnly=TRUE)
> A C G T other
> [1,] 0 1 0 0 0
> [2,] 1 0 0 0 0
> [3,] 0 2 0 0 0
> [4,] 0 0 0 2 0
> [5,] 3 0 0 0 0
> [6,] 0 0 0 9 0
> [7,] 0 0 0 0 0
> [8,] 0 0 18 2 0
> [9,] 1 0 0 26 0
>
> See below for pileLettersAt's code.
>
> Cheers,
> H.
>
>
> -----------------------------------------------------------------------
>
> ### Conceptually equivalent to:
> ### sapply(x, function(xx) paste(xx, collapse=""))
>
> collapse_XStringSetList <- function(x)
> {
> unlisted_x <- unlist(x, use.names=FALSE)
> unlisted_ans <- unlist(unlisted_x, use.names=FALSE)
> ans_width <- sum(relist(width(unlisted_x), x))
> extract_at <- as(PartitioningByEnd(cumsum(ans_width)), "IRanges")
> extractAt(unlisted_ans, extract_at)
> }
>
> ### pileLettersOnSingleRefAt() is the workhorse behind pileLettersAt().
> ### 'x', 'pos', 'cigar': 3 parallel vectors describing N strings aligned
> ### to the same reference sequence. 'x' must be an XStringSet (typically
> ### DNAStringSet) object containing the unaligned strings (a.k.a. the
> ### query sequences) reported with respect to the + strand. 'pos' must
> ### be an integer vector where 'pos[i]' is the 1-based position on the
> ### reference sequence of the first aligned letter in 'x[[i]]'. 'cigar'
> ### must be a character vector containing the extended CIGAR strings.
> ### 'at': must be an integer vector containing single nucleotide
> ### positions on the reference sequence.
> ### Returns an XStringSet (typically DNAStringSet) object parallel to
> ### 'at' (i.e. with 1 string per single nucleotide position).
>
> pileLettersOnSingleRefAt <- function(x, pos, cigar, at)
> {
> stopifnot(is(x, "XStringSet"))
> N <- length(x) # nb of alignments
> stopifnot(is.integer(pos) && length(pos) == N)
> stopifnot(is.character(cigar) && length(cigar) == N)
> stopifnot(is.integer(at))
>
> ops <- c("M", "=", "X")
> ranges_on_ref <- cigarRangesAlongReferenceSpace(cigar, pos=pos, ops=ops)
> ranges_on_query <- cigarRangesAlongQuerySpace(cigar, ops=ops)
>
> ## 'ranges_on_ref' and 'ranges_on_query' are IRangesList objects parallel
> ## to 'x', 'pos', and 'cigar'. In addition, the 2 IRangesList objects
> ## have the same "shape" (i.e. same elementLengths()), so, after
> ## unlisting, the 2 unlisted objects are parallel IRanges objects.
> unlisted_ranges_on_ref <- unlist(ranges_on_ref, use.names=FALSE)
> unlisted_ranges_on_query <- unlist(ranges_on_query, use.names=FALSE)
>
> ## 2 integer vectors parallel to IRanges objects 'unlisted_ranges_on_ref'
> ## and 'unlisted_ranges_on_query' above.
> range_group <- togroup(ranges_on_ref)
> query2ref_shift <- start(unlisted_ranges_on_ref) -
> start(unlisted_ranges_on_query)
>
> hits <- findOverlaps(at, unlisted_ranges_on_ref)
> hits_at_in_x <- at[queryHits(hits)] - query2ref_shift[subjectHits(hits)]
> hits_group <- range_group[subjectHits(hits)]
> unlisted_piles <- subseq(x[hits_group], start=hits_at_in_x, width=1L)
> piles_skeleton <- PartitioningByEnd(queryHits(hits), NG=length(at),
> names=names(at))
> piles <- relist(unlisted_piles, piles_skeleton)
> collapse_XStringSetList(piles)
> }
>
> ### 'x', 'seqnames', 'pos', 'cigar': 4 parallel vectors describing N
> ### aligned strings. 'x', 'pos', and 'cigar' as above. 'seqnames' must
> ### be a factor-Rle where 'seqnames[i]' is the name of the reference
> ### sequence of the i-th alignment.
> ### 'at': must be a GRanges object containing single nucleotide
> ### positions. 'seqlevels(at)' must be identical to 'levels(seqnames)'.
> ### Returns an XStringSet (typically DNAStringSet) object parallel to
> ### 'at' (i.e. with 1 string per single nucleotide position).
>
> pileLettersAt <- function(x, seqnames, pos, cigar, at)
> {
> stopifnot(is(x, "XStringSet"))
> N <- length(x) # nb of alignments
> stopifnot(is(seqnames, "Rle"))
> stopifnot(is.factor(runValue(seqnames)))
> stopifnot(length(seqnames) == N)
> stopifnot(is.integer(pos) && length(pos) == N)
> stopifnot(is.character(cigar) && length(cigar) == N)
> stopifnot(is(at, "GRanges"))
> stopifnot(all(width(at) == 1L))
> stopifnot(identical(seqlevels(at), levels(seqnames)))
>
> ## We process 1 chromosome at a time. So we start by splitting
> ## 'x', 'pos', 'cigar', and 'start(at)' by chromosome. The 4
> ## resulting list-like objects have 1 list element per chromosome
> ## in 'seqlevels(at)' (or in 'levels(seqnames)', which is identical
> ## to 'seqlevels(at)').
> x_by_chrom <- split(x, seqnames) # XStringSetList
> pos_by_chrom <- split(pos, seqnames) # IntegerList
> cigar_by_chrom <- split(cigar, seqnames) # CharacterList
> at_by_chrom <- split(start(at), seqnames(at)) # IntegerList
>
> ## Unsplit index.
> split_idx <- unlist(split(seq_along(at), seqnames(at)),
> use.names=FALSE)
> unsplit_idx <- integer(length(at))
> unsplit_idx[split_idx] <- seq_along(at)
>
> do.call("c", lapply(seq_along(seqlevels(at)),
> function(i)
> pileLettersOnSingleRefAt(
> x_by_chrom[[i]],
> pos_by_chrom[[i]],
> cigar_by_chrom[[i]],
> at_by_chrom[[i]])))[unsplit_idx]
> }
>
>
>>
>> Thank you in advance for any advice.
>>
>> Kim Siegmund
>>
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list