[BioC] Filtering BAM files by start position for VariantTools

Dan Tenenbaum dtenenba at fhcrc.org
Fri Aug 2 19:18:05 CEST 2013


On Fri, Aug 2, 2013 at 7:27 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
> I will fix this today out of necessity to get it building on one of our
> clusters. It's not going to be fun.
>

Thanks! If the solution could work with standard versions of automake
available on Ubuntu 12.04LTS that would be great, and mean that our
build systems would not need any special tweaking.

Dan


>
> On Fri, Aug 2, 2013 at 7:08 AM, Taylor, Sean D <sdtaylor at fhcrc.org> wrote:
>>
>> Martin, it always makes me feel better when I see you scratching your head
>> too! ;)
>>
>> Sean
>>
>> Sent from my iPod
>>
>> On Aug 1, 2013, at 6:02 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote:
>>
>> > On 07/16/2013 05:40 PM, Michael Lawrence wrote:
>> >> The necessary update to VariantTools will propagate soon to devel. Or
>> >> you
>> >> can grab it from svn.
>> >
>> > This isn't building in devel; does it require something special with
>> > gmapR? Dan and I looked at this (??) for the conference, but were not quite
>> > able to master the automake foo required to get off the ground.
>> >
>> > Martin
>> >
>> >>
>> >> Michael
>> >>
>> >>
>> >> On Tue, Jul 16, 2013 at 3:27 PM, Taylor, Sean D <sdtaylor at fhcrc.org>
>> >> wrote:
>> >>
>> >>>  In order to work through some of the code, I installed the devel
>> >>> version
>> >>> of R and updated all the packages. Now when I run tallyVariants() I
>> >>> get the
>> >>> following error message:****
>> >>>
>> >>> ** **
>> >>>
>> >>> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : ****
>> >>>
>> >>>   object 'castList' not found****
>> >>>
>> >>> ** **
>> >>>
>> >>>> traceback()****
>> >>>
>> >>> 12: get(name, envir = asNamespace(pkg), inherits = FALSE)****
>> >>>
>> >>> 11: IRanges:::castList****
>> >>>
>> >>> 10: safe_mclapply(ind, function(i, ...) {****
>> >>>
>> >>>         FUN(gr[i], ...)****
>> >>>
>> >>>     }, ...)****
>> >>>
>> >>> 9: applyByChromosome(si, bam_tally_region, mc.cores = mc.cores)****
>> >>>
>> >>> 8: .local(x, ...)****
>> >>>
>> >>> 7: tallyVariants(x, tally.param)****
>> >>>
>> >>> 6: tallyVariants(x, tally.param)****
>> >>>
>> >>> 5: .local(x, ...)****
>> >>>
>> >>> 4: callVariants(BamFile(x), ...)****
>> >>>
>> >>> 3: callVariants(BamFile(x), ...)****
>> >>>
>> >>> 2: callVariants(destination, tally.param)****
>> >>>
>> >>> 1: callVariants(destination, tally.param)****
>> >>>
>> >>> ** **
>> >>>
>> >>> Perhaps this is something missing/changed in the devel version of
>> >>> IRanges?
>> >>> ****
>> >>>
>> >>> ** **
>> >>>
>> >>> Updated sessionInfo() below:****
>> >>>
>> >>> ** **
>> >>>
>> >>> R version 3.0.1 (2013-05-16)****
>> >>>
>> >>> Platform: x86_64-unknown-linux-gnu (64-bit)****
>> >>>
>> >>> ** **
>> >>>
>> >>> locale:****
>> >>>
>> >>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              ****
>> >>>
>> >>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    ****
>> >>>
>> >>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   ****
>> >>>
>> >>>  [7] LC_PAPER=C                 LC_NAME=C                 ****
>> >>>
>> >>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            ****
>> >>>
>> >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       ****
>> >>>
>> >>> ** **
>> >>>
>> >>> attached base packages:****
>> >>>
>> >>> [1] grid      parallel  stats     graphics  grDevices utils
>> >>> datasets *
>> >>> ***
>> >>>
>> >>> [8] methods   base     ****
>> >>>
>> >>> ** **
>> >>>
>> >>> other attached packages:****
>> >>>
>> >>> [1] BiocInstaller_1.11.3     latticeExtra_0.6-24
>> >>> lattice_0.20-15         ****
>> >>>
>> >>>  [4] RColorBrewer_1.0-5       genoPlotR_0.8
>> >>> ade4_1.5-2              ****
>> >>>
>> >>>  [7] gmapR_1.3.2              VariantTools_1.3.2
>> >>> VariantAnnotation_1.7.34****
>> >>>
>> >>> [10] Rsamtools_1.13.24        Biostrings_2.29.13
>> >>> GenomicRanges_1.13.33   ****
>> >>>
>> >>> [13] XVector_0.1.0            IRanges_1.19.18
>> >>> BiocGenerics_0.7.3      ****
>> >>>
>> >>> ** **
>> >>>
>> >>> loaded via a namespace (and not attached):****
>> >>>
>> >>> [1] AnnotationDbi_1.23.16   Biobase_2.20.0          biomaRt_2.17.2
>> >>> ****
>> >>>
>> >>>  [4] bitops_1.0-5            BSgenome_1.28.0         DBI_0.2-7
>> >>>      ****
>> >>>
>> >>>  [7] GenomicFeatures_1.13.19 graph_1.39.3
>> >>> Matrix_1.0-12          ****
>> >>>
>> >>> [10] RBGL_1.37.2             RCurl_1.95-4.1
>> >>> RSQLite_0.11.4         ****
>> >>>
>> >>> [13] rtracklayer_1.20.2      stats4_3.0.1
>> >>> tools_3.0.1            ****
>> >>>
>> >>> [16] XML_3.96-1.1            zlibbioc_1.6.0         ****
>> >>>
>> >>>> ** **
>> >>>
>> >>> ** **
>> >>>
>> >>> Thanks again,****
>> >>>
>> >>> Sean****
>> >>>
>> >>> ** **
>> >>>
>> >>> *From:* Michael Lawrence [mailto:lawrence.michael at gene.com]
>> >>> *Sent:* Saturday, July 13, 2013 1:59 PM
>> >>> *To:* Taylor, Sean D
>> >>> *Cc:* Michael Lawrence; Pages, Herve; Obenchain, Valerie J;
>> >>> bioconductor at r-project.org
>> >>>
>> >>> *Subject:* Re: [BioC] Filtering BAM files by start position for
>> >>> VariantTools****
>> >>>
>> >>>  ** **
>> >>>
>> >>> ** **
>> >>>
>> >>> ** **
>> >>>
>> >>> On Sat, Jul 13, 2013 at 1:50 PM, Taylor, Sean D <sdtaylor at fhcrc.org>
>> >>> wrote:****
>> >>>
>> >>> Thanks Michael, ****
>> >>>
>> >>> This is an interesting idea. Usually, we resolve PCR/optical
>> >>> duplicates
>> >>> with the Picard MarkDuplicates command, which just chooses the read
>> >>> with
>> >>> the highest average quality. This approximates what you want, I think.
>> >>> ***
>> >>> *
>> >>>
>> >>> Is the Picard MarkDuplicates command run by default? Or is this a
>> >>> separate
>> >>> command from a different package?****
>> >>>
>> >>> ** **
>> >>>
>> >>> It's a separate package, basically the Java implementation of
>> >>> samtools;
>> >>> you might that first and see how it improves your error rates, before
>> >>> taking a more complicated approach.****
>> >>>
>> >>>     Also, I find it hard to believe that sequence errors would be
>> >>> causing
>> >>> you trouble at 10%, as long as you're filtering for quality and have
>> >>> decent
>> >>> coverage. PCR might in limited cases, and Picard effectively takes
>> >>> care of
>> >>> that.****
>> >>>
>> >>> 10% is fine, but we have problems at 1% and lower. I think even the
>> >>> defaults in tallyVariants() use 1% as a cutoff if I’m not
>> >>> mistaken.****
>> >>>
>> >>>  ****
>> >>>
>> >>>  ** **
>> >>>
>> >>> It uses a ~4% cutoff. And yes, you'll start running into issues around
>> >>> 1%.
>> >>> Calling at such a low frequency is kind of ambitious.****
>> >>>
>> >>>  ****
>> >>>
>> >>>     What you want is to specify the cycleBreaks argument to
>> >>> VariantTallyParam. It allows you to define bins by cycle. So if you
>> >>> made
>> >>> bins for all 100bp, you could do things like generate a consensus by
>> >>> read
>> >>> family. A family corresponding to position X would be cycle bin #1 at
>> >>> X,
>> >>> cycle bin #2 at X+1, etc. Then just pick the alt (or ref) with the
>> >>> highest
>> >>> count. Might be tricky to implement efficiently for the whole genome.
>> >>> Of
>> >>> course, this assumes you've got single end reads, otherwise this won't
>> >>> work, because the mate is not considered.****
>> >>>
>> >>> Cool, I will try that out, that may be just what I was looking for. It
>> >>> might be tricky for the whole genome, but we are restricting ourselves
>> >>> to
>> >>> just the mitochondrial genome. Still gives us several thousand start
>> >>> sites
>> >>> to work from but should be easier than the whole genome. As for single
>> >>> end
>> >>> reads, can I just filter on the strand (i.e. ‘+’ or ‘-‘)? Or will I
>> >>> have to
>> >>> perform separate alignments for each of the paired reads?****
>> >>>
>> >>>  ****
>> >>>
>> >>>  ** **
>> >>>
>> >>> Now that I think about it, you should be fine just considering the
>> >>> alignments individually (as if they were unpaired).****
>> >>>
>> >>>  ****
>> >>>
>> >>> ** **
>> >>
>> >>    [[alternative HTML version deleted]]
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Bioconductor mailing list
>> >> Bioconductor at r-project.org
>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >> Search the archives:
>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>> >
>> > --
>> > Computational Biology / Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N.
>> > PO Box 19024 Seattle, WA 98109
>> >
>> > Location: Arnold Building M1 B861
>> > Phone: (206) 667-2793
>
>



More information about the Bioconductor mailing list