[BioC] Filtering BAM files by start position for VariantTools

Taylor, Sean D sdtaylor at fhcrc.org
Fri Jul 12 08:17:50 CEST 2013


Hi all, 

Thanks for lots of great input and ideas. You have given me a lot to try out. In answer to Herve, here is my ultimate goal:
I am trying to look at the dynamics of some specific point mutations. In my control set that I am working with, the point mutations I am interested in exist with known frequencies from 0.1 - 50%. One would think that with NGS deep coverage you could easily pick out low frequency events, but the PCR and sequencing error rates are high enough to effectively mask things below 10%. We are trying out an approach to try to reduce this noise. If your library is prepared through nextera (which generates random fragments followed by a few rounds of PCR) and provided that you have a large enough genome, then it is likely that reads that align to the same start position are descendant from the same template fragment and should have identical sequence. By analyzing these 'families' individually we hope to eliminate a lot of noise by discarding variants that don't exceed a certain threshold within their family. Hence I want to sort the reads by start position into these 'families', tally the variants in the families, filter out the noise (and possibly collapse the family into a single consensus sequence), and then do a global variant tally on what's left.

VariantTools sort of addresses this problem by reporting the number of 'cycles' supporting each variant. The number of cycles should be analogous to the number of families, but I couldn't see that I could use this to do the sort of thing I want to do  beyond filtering out things that don't occur in many families. Which isn't quite what I had in mind. I will have to explore Michael's suggestion about binning further. Is your suggestion to set something like
	 which<-GRanges(seqnames=c('foo'), IRanges(1,1))
	 tally.param <- VariantTallyParam(gmapR::TP53Genome(),
			  readlen = 100L,
			 high_base_quality = 23L,
			 which = which)
	 raw.variants <- tallyVariants(bam, tally.param)
It seems thought that this gave me all reads that overlap that position, not just the ones that started at this position. I'm afraid I'm still learning the ins and outs of the package, so I'm not quite sure how to accomplish what you suggested.

Herve, it looks like your code does allow me to impose the cigar alignment on the sequences, which is something that I have really wanted to be able to do for a while now so I'm excited to try this out.  I'm intrigued by your solution and it seems similar to my first approach which was to read the bam file in as a GappedAlignments object and then split it by start position. I wonder though if I will end up with the same problem that I originally ran into, which is that (correct me if I'm wrong) the tallyVariants function does not accept GappedAlignments objects as a valid input, but rather wants a BAM file.  I suppose I could try using something like a pileup function and sort through it that way, but it would be handy to be able to use tallyVariants as it already does what I want if I can just get the input right. On the other hand, if I end up needing to collapse the families into a single consensus, then this might be the preferable route as I can directly manipulate the whole sequence here.

Valerie, thanks for the follow-up codes. Mclapply() was a good thought. I will play around with your Map() suggestion and see if I can get it to work. Hopefully some combination of all of this will get me to where I want to go.

Thanks!
Sean




-----Original Message-----
From: Hervé Pagès [mailto:hpages at fhcrc.org] 
Sent: Thursday, July 11, 2013 5:15 PM
To: Taylor, Sean D
Cc: Obenchain, Valerie J; Michael Lawrence; bioconductor at r-project.org
Subject: Re: [BioC] Filtering BAM files by start position for VariantTools

Hi Taylor,

Not clear to me what you are trying to do exactly. FWIW here is some code that will group your read sequences (and qualities) by unique start position on the genome. It uses sequenceLayer() from the Rsamtools package (was added to devel recently). The code below has some similarities with the one I sent in that thread in June:

   https://stat.ethz.ch/pipermail/bioconductor/2013-June/053324.html

(1) Load the BAM file into a GAlignments object.

     library(Rsamtools)
     bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
     param <- ScanBamParam(what=c("seq", "qual"))
     gal <- readGAlignmentsFromBam(bamfile, use.names=TRUE, param=param)

(2) Use sequenceLayer() to "lay" the query sequences and quality strings
     on the reference.

     qseq <- setNames(mcols(gal)$seq, names(gal))
     qual <- setNames(mcols(gal)$qual, names(gal))
     qseq_on_ref <- sequenceLayer(qseq, cigar(gal),
                                  from="query", to="reference")
     qual_on_ref <- sequenceLayer(qual, cigar(gal),
                                  from="query", to="reference")

(3) Split by chromosome.

     qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal))
     qual_on_ref_by_chrom <- splitAsList(qual_on_ref, seqnames(gal))
     pos_by_chrom <- splitAsList(start(gal), seqnames(gal))

(4) For each chromosome generate one GRanges object that contains
     unique alignment start positions and attach 3 metadata columns
     to it: the number of reads, the query sequences, and the quality
     strings.

     gr_by_chrom <- lapply(seqlevels(gal),
       function(seqname)
       {
         qseq_on_ref2 <- qseq_on_ref_by_chrom[[seqname]]
         qual_on_ref2 <- qual_on_ref_by_chrom[[seqname]]
         pos2 <- pos_by_chrom[[seqname]]
         qseq_on_ref_per_pos <- split(qseq_on_ref2, pos2)
         qual_on_ref_per_pos <- split(qual_on_ref2, pos2)
         nread <- elementLengths(qseq_on_ref_per_pos)
         gr_mcols <- DataFrame(nread=unname(nread),
                               qseq_on_ref=unname(qseq_on_ref_per_pos),
                               qual_on_ref=unname(qual_on_ref_per_pos))
         gr <- GRanges(Rle(seqname, nrow(gr_mcols)),
                       IRanges(as.integer(names(nread)), width=1))
         mcols(gr) <- gr_mcols
         seqlevels(gr) <- seqlevels(gal)
         gr
       })

(5) Combine all the GRanges objects obtained in (4) in 1 big GRanges
     object:

     gr <- do.call(c, gr_by_chrom)
     seqinfo(gr) <- seqinfo(gal)

'gr' is a GRanges object that contains unique alignment start positions:

   > gr[1:6]
   GRanges with 6 ranges and 3 metadata columns:
         seqnames    ranges strand |     nread
            <Rle> <IRanges>  <Rle> | <integer>
     [1]     seq1  [ 1,  1]      * |         1
     [2]     seq1  [ 3,  3]      * |         1
     [3]     seq1  [ 5,  5]      * |         1
     [4]     seq1  [ 6,  6]      * |         1
     [5]     seq1  [ 9,  9]      * |         1
     [6]     seq1  [13, 13]      * |         2
 
qseq_on_ref
 
<DNAStringSetList>
     [1]
CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
     [2]
CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
     [3]
AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
     [4]
GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
     [5]
GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
     [6]
ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA,ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAG
 
qual_on_ref
 
<BStringSetList>
     [1]
<<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7
     [2]
<<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6):
     [3]
<<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5
     [4]
(-&----,----)-)-),'--)---',+-,),''*,
     [5]
<<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5%
     [6]
<<<<<<<<;<<<8<<<<<;8:;6/686&;(16666,<<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;<0;
     ---
     seqlengths:
      seq1 seq2
      1575 1584

2 reads align to start position 13. Let's have a close look at their
sequences:

   > mcols(gr)$qseq_on_ref[[6]]
     A DNAStringSet instance of length 2
       width seq                                               names
   [1]    35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA 
EAS56_61:6:18:467...
   [2]    36 ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAG 
EAS114_28:5:296:3...

and their qualities:

   > mcols(gr)$qual_on_ref[[6]]
     A BStringSet instance of length 2
       width seq                                               names
   [1]    35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666 
EAS56_61:6:18:467...
   [2]    36 <<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;<0; 
EAS114_28:5:296:3...

Note that the sequence and quality strings are those projected to the reference so the first letter in those strings are on top of start position 13, the 2nd letter on top of position 14, etc...

Not sure what you want to do from here though...

H.

On 07/11/2013 10:27 AM, Taylor, Sean D wrote:
> Thanks Valerie, I'll give that a try. And I see that you just clarified my question about the input file for tallyVariants.
>
> I had thought about doing this before but hadn't been able to figure out the proper filter syntax. Thanks!  The only downside to this approach is that I ultimately want to do this for all unique start sites that align to my reference. That could entail generating thousands of tempfiles that all have to be read back in. It could be done with lapply, but it sounds rather memory and time intensive. Another approach that I tried was reading the bam file as a GappedAlignments object and then splitting it by start position into a GAlignmentsList. That was really easy, but so far as I can tell tallyVariants will not accept GappedAlignments objects as an input. I wonder if there is another way to vectorize this approach?
>
> Thanks,
> Sean
>
>
> -----Original Message-----
> From: Valerie Obenchain [mailto:vobencha at fhcrc.org]
> Sent: Thursday, July 11, 2013 10:02 AM
> To: Taylor, Sean D
> Cc: bioconductor at r-project.org; Michael Lawrence
> Subject: Re: [BioC] Filtering BAM files by start position for 
> VariantTools
>
> Hi Sean,
>
> As you've discovered, the 'which' in the 'param' (for reading bam files) specifies positions to overlap, not start or end. One approach to isolating reads with a specific starting position would be to filter the bam by 'pos'.
>
>       library(VariantTools)
>       fl <- LungCancerLines::LungCancerBamFiles()$H1993
>
>       mystart <- 1110426
>       filt <- list(setStart=function(x) x$pos %in% mystart)
>       dest <- tempfile()
>       filterBam(fl, dest, filter=FilterRules(filt))
>       scn <- scanBam(dest)
>
> Confirm all reads start with 'mystart':
>
>   > table(scn[[1]]$pos)
>
> 1110426
>      2388
>
> If you want a tally of all nucleotides for all sequences starting with 'mystart' then no need to supply a 'which':
>       param <- VariantTallyParam(gmapR::TP53Genome(),
>                                  readlen=100L,
>                                  high_base_quality=23L)
>       tally <- tallyVariants(fl, param)
>
>
> Valerie
>
>
> On 07/09/2013 02:06 PM, Taylor, Sean D wrote:
>> I am trying to read a specific set of records from a bam file for use in the VariantTools package. I'm trying to construct a which argument (a GRanges object) that will pull in a set of records from all reads that only start at a specified position. (i.e. all reads that start at position 100). So far I have only been able to specify reads that overlap position 100, but have not been able to find a way to define the start site.
>>
>> #Example code:
>>> bams <- LungCancerLines::LungCancerBamFiles()
>>> bam <- bams$H1993
>>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+') 
>>> tally.param <- VariantTallyParam(gmapR::TP53Genome(),
>> + readlen = 100L,
>> + high_base_quality = 23L,
>> + which = which)
>>> raw.variants <- tallyVariants(bam, tally.param)
>>
>> This code shows all the variants at position 1110426, but not all the variants from the reads that start at position 1110426.
>>
>> Ultimately, I am trying to do this for all start positions in my data set, so I would want something that looks like this pseudocode:
>>> raw.variants<-lapply (start(bam), function (x){
>>     which<-GRanges(seqnames=c('chrM'), '+', start=x)
>>     tally.param<-VariantTallyParam(gmap, readlen=100L, which=which)
>>     tallyVariants(bamfile, tally.param)
>> })
>>
>> Thanks,
>> Sean
>>
>>> sessionInfo()
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
>>    [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>    [7] LC_PAPER=C                 LC_NAME=C                  LC_ADDRESS=C
>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] grid      parallel  stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] RSQLite_0.11.4          DBI_0.2-7               BiocInstaller_1.10.2
>>    [4] LungCancerLines_0.0.8   GenomicFeatures_1.12.1  AnnotationDbi_1.22.5
>>    [7] Biobase_2.20.0          gmapR_1.2.0             latticeExtra_0.6-24
>> [10] lattice_0.20-15         RColorBrewer_1.0-5      genoPlotR_0.8
>> [13] ade4_1.5-2              VariantTools_1.2.2      VariantAnnotation_1.6.6
>> [16] Rsamtools_1.12.3        Biostrings_2.28.0       GenomicRanges_1.12.4
>> [19] IRanges_1.18.1          BiocGenerics_0.6.0
>>
>> loaded via a namespace (and not attached):
>> [1] biomaRt_2.16.0                          bitops_1.0-5
>>    [3] BSgenome_1.28.0                         BSgenome.Hsapiens.UCSC.hg19_1.3.19
>>    [5] graph_1.38.2                            Matrix_1.0-12
>>    [7] org.Hs.eg.db_2.9.0                      RBGL_1.36.2
>>    [9] RCurl_1.95-4.1                          rtracklayer_1.20.2
>> [11] stats4_3.0.1                            tools_3.0.1
>> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 [15] 
>> zlibbioc_1.6.0
>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list