[BioC] problems with strand in predictCoding

Fri Apr 20 19:03:43 CEST 2012

On 04/20/2012 07:28 AM, Tim Triche, Jr. wrote:
> I meant to put this in my initial reply, but I firmly agree with Michael
> Lawrence.
>
> The SummarizedExperiment class is a great thing, and subsetting it by (say)
>   my.SE[ a.GRanges, some.patients ] is unbelievably convenient, especially
> if it needs to be done repeatedly, or by items in a GRangesList.  The
> strand issue is a nasty gotcha sometimes.

Hi Tim -- kind of a different thread here, but I wanted to say that yes, 
we've heard you. A problem is the many-to-many mapping that's possible, 
for instance for some idx, we really expect nrow(x[idx,]) == 
length(idx); there are additional complexities introduced by, e.g., a 
GRangesList and in the end the convenience for some use cases didn't 
seem to outweigh the ambiguity and complexity for the general case.

Martin

>
> just MHO though.
>
>
>
> On Fri, Apr 20, 2012 at 7:25 AM, Tim Triche, Jr.<tim.triche at gmail.com>wrote:
>
>>> a Bioc Inferno book
>>
>> This is one of the better ideas to be bandied about lately...
>>
>>
>>
>>
>> On Fri, Apr 20, 2012 at 5:43 AM, Michael Lawrence<
>> lawrence.michael at gene.com>  wrote:
>>
>>> On Thu, Apr 19, 2012 at 6:46 PM, Jeremiah Degenhardt<
>>> degenhardt.jeremiah at gene.com>  wrote:
>>>
>>>> Hello BioC,
>>>>
>>>> I am using the predictCoding function in the VariantAnnotation package
>>>> and have run into a few issues.
>>>>
>>>> The first issue that I found is a small one. While I can supply any
>>>> txdb object that I want to the predictCoding, the function seems to
>>>> still be looking for TxDb.Hsapiens.UCSC.hg19.knownGene. Running
>>>> predict coding without this package installed results in the following
>>>> error:
>>>>
>>>> Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) :
>>>>    error in evaluating the argument 'x' in selecting a method for
>>>> function 'isCircular': Error: object
>>>> 'TxDb.Hsapiens.UCSC.hg19.knownGene' not found
>>>>
>>>> The second set of issues are a little more subtle. It seems that
>>>> predictCoding is not properly handling the strand of the variants, at
>>>> least in my opinion.
>>>>
>>>> If I provide an unstranded GRanges with variants that overlap a gene
>>>> on the negative strand, the predictCoding function will not reverse
>>>> complement the varAllele resulting in an incorrect functional
>>>> consequence prediction. Here is some code to generate an example
>>>>
>>>> library(VariantAnnotation)
>>>> library(BSgenome.Hsapiens.UCSC.hg19)
>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>> txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>> GR<- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1),
>>>> variant = DNAStringSet("T"))
>>>> predictCoding(GR, txdb, Hsapiens, values(GR)$variant)
>>>>
>>>> GRanges with 4 ranges and 13 elementMetadata cols:
>>>>       seqnames               ranges strand |        variant
>>>> varAllele
>>>>          <Rle>             <IRanges>   <Rle>  |<DNAStringSet>
>>>> <DNAStringSet>
>>>>   [1]    chr17 [63554271, 63554271]      * |              T
>>>>   T
>>>>   [2]    chr17 [63554271, 63554271]      * |              T
>>>>   T
>>>>   [3]    chr17 [63554271, 63554271]      * |              T
>>>>   T
>>>>   [4]    chr17 [63554271, 63554271]      * |              T
>>>>   T
>>>>           cdsLoc              proteinLoc   queryID        txID
>>>> cdsID
>>>>        <IRanges>  <CompressedIntegerList>  <integer>  <character>
>>>> <integer>
>>>>   [1] [468, 468]                     156         1       65536
>>>> 193561
>>>>   [2] [468, 468]                     156         1       65537
>>>> 193561
>>>>   [3] [468, 468]                     156         1       65538
>>>> 193561
>>>>   [4] [468, 468]                     156         1       65539
>>>> 193564
>>>>            geneID consequence       refCodon       varCodon
>>>> refAA
>>>>       <character>     <factor>  <DNAStringSet>  <DNAStringSet>
>>>> <AAStringSet>
>>>>   [1]        8313  synonymous            TAC            TAT
>>>>   Y
>>>>   [2]        8313  synonymous            TAC            TAT
>>>>   Y
>>>>   [3]        8313  synonymous            TAC            TAT
>>>>   Y
>>>>   [4]        8313  synonymous            TAC            TAT
>>>>   Y
>>>>               varAA
>>>>       <AAStringSet>
>>>>   [1]             Y
>>>>   [2]             Y
>>>>   [3]             Y
>>>>   [4]             Y
>>>>   ---
>>>>   seqlengths:
>>>>    chr17
>>>>       NA
>>>>
>>>> Running this the function predicts this to be TAC>  TAT, a synonymous
>>>> change. However, as the gene is on the negative strand the actual
>>>> result of this variant should be TAC>  TAA or a stop mutation.
>>>>
>>>> If I instead give predictCoding a stranded GRanges to let the function
>>>> know that the base reported is with respect to the positive strand I
>>>> get this result:
>>>>
>>>>> GR<- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1),
>>>> strand="+", variant = DNAStringSet("T"))
>>>>> predictCoding(GR, txdb, Hsapiens, values(GR)$variant)
>>>> GRanges with 0 ranges and 0 elementMetadata cols:
>>>>
>>>>    seqnames    ranges strand
>>>>       <Rle>  <IRanges>   <Rle>
>>>>   ---
>>>>   seqlengths:
>>>>
>>>>
>>>> Now instead of reverse complementing the varAllele, and giving me the
>>>> correct annotation, the function simply misses that the mutation
>>>> overlaps with any gene and returns an empty GRanges.
>>>>
>>>> This issue is not really a bug in predictCoding, but results from what
>>>> is (again in my opinion) an unfortunate treatment of the negative and
>>>> positive strand as separate entities in GRanges. From a biological
>>>> prospective this almost never makes sense and is not how I would
>>>> expect the functions in GRanges to behave. Just because a gene is
>>>> annotated on the negative strand does not mean that I don't want to
>>>> know it overlaps with a gene on the positive strand. Maybe there are
>>>> cases where this is the desired behavior, so it might make sense to
>>>> add this a switch you could turn on or off but in most cases this will
>>>> simply lead to erroneous results such as the one above.
>>>
>>>
>>> There is an "ignore.strand" argument to findOverlaps, so we have a switch.
>>> I have always thought that strand should be ignored by default in
>>> operations like overlap detection, and only considered as a "direction"
>>> rather than as separate in space. It's very useful for resize() and
>>> flank()
>>> to consider strand, but not so useful for findOverlaps. The
>>> ignore.strand=FALSE in those cases default would qualify for the eight
>>> circle if there were a Bioc Inferno book. It's only the default that I
>>> argue with though, having the capability to consider strand is useful.
>>>
>>>
>>>> I think the
>>>> most useful information that comes from knowing the strand is knowing
>>>> when you need to reverse complement a sequence for the underlying
>>>> position of interest.
>>>>
>>>> A third possible option here would be to call the function twice, once
>>>> with for the positive strand and once for the negative strand with the
>>>> variants reverse complemented and then merge the results. This
>>>> however, seems quite inefficient and not like something that I should
>>>> have to do.
>>>>
>>>> > From my perspective the proper behavior of predictCoding with respect
>>>> to strand would be to treat unstranded GRanges as positive strand as
>>>> this is the reference strand for things like the genome builds and vcf
>>>> files. Then the variants should overlap positive and negative strand
>>>> genes and should be reverse complemented for consequence prediction on
>>>> the negative strand genes. It should also allow overlap of negative
>>>> and positive stranded variants with both negative and positive
>>>> stranded genes, but properly reverse complement the variant in each
>>>> case to get the proper consequence.
>>>>
>>>> thanks in advance,
>>>>
>>>> Jeremiah
>>>>
>>>> here is my session info incase it is needed:
>>>>
>>>>> sessionInfo()
>>>> R Under development (unstable) (2012-04-11 r58985)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>   [7] LC_PAPER=C                 LC_NAME=C
>>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>>   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1
>>>>   [2] GenomicFeatures_1.8.1
>>>>   [3] AnnotationDbi_1.18.0
>>>>   [4] Biobase_2.16.0
>>>>   [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17
>>>>   [6] BSgenome_1.24.0
>>>>   [7] VariantAnnotation_1.2.5
>>>>   [8] Rsamtools_1.8.1
>>>>   [9] Biostrings_2.24.1
>>>> [10] GenomicRanges_1.8.3
>>>> [11] IRanges_1.14.2
>>>> [12] BiocGenerics_0.2.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>>   [1] biomaRt_2.12.0     bitops_1.0-4.1     compiler_2.16.0
>>>> DBI_0.2-5
>>>>   [5] grid_2.16.0        lattice_0.20-6     Matrix_1.0-6
>>>> RCurl_1.91-1
>>>>   [9] RSQLite_0.11.1     rtracklayer_1.16.1 snpStats_1.6.0
>>>> splines_2.16.0
>>>> [13] stats4_2.16.0      survival_2.36-12   tools_2.16.0
>>>> XML_3.9-4
>>>> [17] zlibbioc_1.2.0
>>>>
>>>> --
>>>> Jeremiah Degenhardt, Ph.D.
>>>> Computational Biologist
>>>> Bioinformatics and Computational Biology
>>>> Genentech, Inc.
>>>> degenhardt.jeremiah at gene.com
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>>
>> --
>> *A model is a lie that helps you see the truth.*
>> *
>> *
>> Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>
>>
>
>

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793