[BioC] Complete variant toolbox: gmapR/VariantTools/VariantAnnotation

Thu Dec 12 21:35:27 CET 2013

Thanks for the detail on the summary functions. I agree these would be 
useful to have. I'll put this on the TODO for this dev cycle.

Thanks.
Valerie

On 12/10/2013 09:09 AM, Thomas Girke wrote:
> Hi Valerie,
>
> Adding a 'REFLOC' column to the output of locateVariants() would address
> this need. Thanks for looking into this.
>
> As for the need for a summary_var_report IN ADDITION TO to a
> complete_var_report, the primitive approach, used to create the results
> shown on the slides, is here:
> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_12_16_2013/Rvarseq/Rvarseq_Fct.R
> Right now this is just a pointer to show students how this could be done
> rather than something I would consider even remotely a finished solution
> for a package. To achieve the latter, one definitely should look into
> how to get rid of some of the tapply steps. As expert of the VCF and
> related classes you might have much more elegant and efficient solutions
> to this? Also, to address some of Julian's concerns related to ambiguous
> annotations, in case of overlapping genes one would append/prepend (but
> only for those) the GENEID to the annotation feature names, e.g.
> coding_GENE1__coding_GENE2. The result will end up being a gene-centric
> rather a transcript-centric report, meaning we are loosing the
> assignment to specific transcript variants. In 90% of the use cases of
> our discovery oriented VAR-Seq projects, gene resolution is sufficient
> here (e.g. supplement tables for publications or grant applications). If
> transcript resolution is needed then users are usually happy to look the
> results up in the complete variant report. Alternatively, one could
> easily do the same on the transcript level, but here a summary report
> may become quickly too complex to be useful for practitioners. Perhaps a
> well designed Var Summary Report function would include a summary_mode
> argument where the user could decide whether to output a gene- or
> transcript-centric summary_var_report.
>
> In general, this is obviously one of these tasks where it will be hard
> to reach consensus among biologists how exactly the ideal VAR summary
> report should look like. However, tackling this problem at least somehow
> is extremely important as for biologist this may be one of the most
> crucial features of any variant annotation tool. Most of them will not
> know how to get things from VRanges/GRanges/VCF objects into a file
> containing less than 100K lines that they can easily digest in a
> spreadsheet program and is also supported in the supplement section of
> most scientific journals (usually limited to Excel).
>
> Best,
>
> Thomas
>
>
> On Mon, Dec 09, 2013 at 08:07:34PM +0000, Valerie Obenchain wrote:
>> Hi Thomas,
>>
>> On 12/08/2013 09:08 AM, Thomas Girke wrote:
>>> Dear Michael and Valerie,
>>>
>>> VariantTools and VariantAnnotation are awesome packages. To the best of my
>>> knowledge, VariantTools is currently the only Bioc/R package that performs
>>> variant calling and it does this in a very nice way. With the available
>>> resources it is now straightforward to set up complete workflows for variant
>>> calling projects: (1) variant aware read alignments with GSNAP from gmapR ->
>>> (2) variant calling/filtering with VariantTools -> (3) adding genomic context
>>> with VariantAnnotation. This is really amazing!!!
>>>
>>> Here are a few questions related to both packages:
>>>
>>> (1) For teaching purposes and other obvious reasons it would be useful if a
>>> Windows version of VariantTools were available (and perhaps for gmapR too).
>>> Installing the package (includes gmapR) from source works fine on both Linux
>>> and OS X, but not on Windows.
>>>
>>> (2) The VRanges class is another great resource for filtering variant calls.
>>> What I was not able to locate though is a description/definition of the content
>>> of its different columns/components. Is something like this available
>>> somewhere?
>>>
>>> (3) When annotation variants with utilities from VariantAnnotation, it would
>>> useful to provide a convenience Summary Report function at the end of the
>>> workflow that exports the annotations to a file. A very common need here is to
>>> collapse the annotations for each variant on a single line so that one doesn't
>>> end up with annotation results of millions of lines as it is typical for many
>>> variant discovery projects. This also simplifies joins among different
>>> annotation instances because it maintains uniqueness among variant identifiers.
>>> This approach is often useful when comparing (joining) the variants among
>>> different genotypes (e.g. which variants are identical or unique among
>>> different mutants). An example solution is shown on slides 34-35 of this
>>> presentation:
>>> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_12_16_2013/Rvarseq/Rvarseq.pdf
>>>
>>
>> The variantReport() and codingReport() functions looks great. Would you
>> be willing to contribute them to VariantAnnotation?
>>
>>> (4) predictCoding() reports the relative location where exactly a variant maps
>>> to an annotation range. It would be nice if locateVariants() could report the
>>> exact relative mapping locations too, e.g. variant chr1:1033_A/T maps to
>>> position x of 5'UTR. Perhaps this is already possible but I couldn't figure
>>> out how to do it without reaching too far into my own hacking toolbox.
>>>
>>
>> I could add a 'REFLOC' column to the otuput of locateVariants() that
>> would essentially be the "equivalent" to 'CDSLOC' from predictCoding().
>>
>> Valerie
>>
>>
>>> Thanks for providing these excellent resources and most importantly your patience
>>> listing to these unsolicited questions.
>>>
>>> Best,
>>>
>>>
>>> Thomas
>>>
>>>
>>>
>>>> sessionInfo()
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] VariantTools_1.4.5      VariantAnnotation_1.8.7 Rsamtools_1.14.2
>>> [4] Biostrings_2.30.1       GenomicRanges_1.14.3    XVector_0.2.0
>>> [7] IRanges_1.20.6          BiocGenerics_0.8.0
>>>
>>> loaded via a namespace (and not attached):
>>>    [1] AnnotationDbi_1.24.0   BatchJobs_1.1-1135     BBmisc_1.4
>>>    [4] Biobase_2.22.0         BiocParallel_0.4.1     biomaRt_2.18.0
>>>    [7] bitops_1.0-6           brew_1.0-6             BSgenome_1.30.0
>>> [10] codetools_0.2-8        DBI_0.2-7              digest_0.6.3
>>> [13] fail_1.2               foreach_1.4.1          GenomicFeatures_1.14.2
>>> [16] gmapR_1.4.2            grid_3.0.2             iterators_1.0.6
>>> [19] lattice_0.20-24        Matrix_1.1-0           plyr_1.8
>>> [22] RCurl_1.95-4.1         RSQLite_0.11.4         rtracklayer_1.22.0
>>> [25] sendmailR_1.1-2        stats4_3.0.2           tools_3.0.2
>>> [28] XML_3.95-0.2           zlibbioc_1.8.0
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>> --
>> Valerie Obenchain
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B155
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: vobencha at fhcrc.org
>> Phone:  (206) 667-3158
>> Fax:    (206) 667-1319

-- 
Valerie Obenchain

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: vobencha at fhcrc.org
Phone:  (206) 667-3158
Fax:    (206) 667-1319