[BioC] problems with strand in predictCoding

Hervé Pagès hpages at fhcrc.org
Fri Apr 20 22:50:39 CEST 2012


Hi Jeremiah and other coding predictors,

On 04/19/2012 06:59 PM, Sean Davis wrote:
> On Thu, Apr 19, 2012 at 9:46 PM, Jeremiah Degenhardt
> <degenhardt.jeremiah at gene.com>  wrote:
>> Hello BioC,
>>
>> I am using the predictCoding function in the VariantAnnotation package
>> and have run into a few issues.
>>
>> The first issue that I found is a small one. While I can supply any
>> txdb object that I want to the predictCoding, the function seems to
>> still be looking for TxDb.Hsapiens.UCSC.hg19.knownGene. Running
>> predict coding without this package installed results in the following
>> error:
>>
>> Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) :
>>    error in evaluating the argument 'x' in selecting a method for
>> function 'isCircular': Error: object
>> 'TxDb.Hsapiens.UCSC.hg19.knownGene' not found
>>
>> The second set of issues are a little more subtle. It seems that
>> predictCoding is not properly handling the strand of the variants, at
>> least in my opinion.
>>
>> If I provide an unstranded GRanges with variants that overlap a gene
>> on the negative strand, the predictCoding function will not reverse
>> complement the varAllele resulting in an incorrect functional
>> consequence prediction. Here is some code to generate an example
>>
>> library(VariantAnnotation)
>> library(BSgenome.Hsapiens.UCSC.hg19)
>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>> txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
>> GR<- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1),
>> variant = DNAStringSet("T"))
>> predictCoding(GR, txdb, Hsapiens, values(GR)$variant)
>>
>> GRanges with 4 ranges and 13 elementMetadata cols:
>>       seqnames               ranges strand |        variant
>> varAllele
>>          <Rle>              <IRanges>    <Rle>  |<DNAStringSet>
>> <DNAStringSet>
>>   [1]    chr17 [63554271, 63554271]      * |              T
>>   T
>>   [2]    chr17 [63554271, 63554271]      * |              T
>>   T
>>   [3]    chr17 [63554271, 63554271]      * |              T
>>   T
>>   [4]    chr17 [63554271, 63554271]      * |              T
>>   T
>>           cdsLoc              proteinLoc   queryID        txID
>> cdsID
>>        <IRanges>  <CompressedIntegerList>  <integer>  <character>
>> <integer>
>>   [1] [468, 468]                     156         1       65536
>> 193561
>>   [2] [468, 468]                     156         1       65537
>> 193561
>>   [3] [468, 468]                     156         1       65538
>> 193561
>>   [4] [468, 468]                     156         1       65539
>> 193564
>>            geneID consequence       refCodon       varCodon
>> refAA
>>       <character>      <factor>  <DNAStringSet>  <DNAStringSet>
>> <AAStringSet>
>>   [1]        8313  synonymous            TAC            TAT
>>   Y
>>   [2]        8313  synonymous            TAC            TAT
>>   Y
>>   [3]        8313  synonymous            TAC            TAT
>>   Y
>>   [4]        8313  synonymous            TAC            TAT
>>   Y
>>               varAA
>>       <AAStringSet>
>>   [1]             Y
>>   [2]             Y
>>   [3]             Y
>>   [4]             Y
>>   ---
>>   seqlengths:
>>    chr17
>>       NA
>>
>> Running this the function predicts this to be TAC>  TAT, a synonymous
>> change. However, as the gene is on the negative strand the actual
>> result of this variant should be TAC>  TAA or a stop mutation.
>>
>> If I instead give predictCoding a stranded GRanges to let the function
>> know that the base reported is with respect to the positive strand I
>> get this result:
>>
>>> GR<- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), strand="+", variant = DNAStringSet("T"))
>>> predictCoding(GR, txdb, Hsapiens, values(GR)$variant)
>> GRanges with 0 ranges and 0 elementMetadata cols:
>>
>>    seqnames    ranges strand
>>       <Rle>  <IRanges>    <Rle>
>>   ---
>>   seqlengths:
>>
>>
>> Now instead of reverse complementing the varAllele, and giving me the
>> correct annotation, the function simply misses that the mutation
>> overlaps with any gene and returns an empty GRanges.
>>
>> This issue is not really a bug in predictCoding, but results from what
>> is (again in my opinion) an unfortunate treatment of the negative and
>> positive strand as separate entities in GRanges. From a biological
>> prospective this almost never makes sense and is not how I would
>> expect the functions in GRanges to behave. Just because a gene is
>> annotated on the negative strand does not mean that I don't want to
>> know it overlaps with a gene on the positive strand. Maybe there are
>> cases where this is the desired behavior, so it might make sense to
>> add this a switch you could turn on or off but in most cases this will
>> simply lead to erroneous results such as the one above. I think the
>> most useful information that comes from knowing the strand is knowing
>> when you need to reverse complement a sequence for the underlying
>> position of interest.
>>
>> A third possible option here would be to call the function twice, once
>> with for the positive strand and once for the negative strand with the
>> variants reverse complemented and then merge the results. This
>> however, seems quite inefficient and not like something that I should
>> have to do.
>>
>> > From my perspective the proper behavior of predictCoding with respect
>> to strand would be to treat unstranded GRanges as positive strand as
>> this is the reference strand for things like the genome builds and vcf
>> files. Then the variants should overlap positive and negative strand
>> genes and should be reverse complemented for consequence prediction on
>> the negative strand genes. It should also allow overlap of negative
>> and positive stranded variants with both negative and positive
>> stranded genes, but properly reverse complement the variant in each
>> case to get the proper consequence.
>
> I agree with Jeremiah that the treatment of unstranded variants should
> be to default to treating them as on the positive strand and reverse
> complement them for negative strand genes.  All other variant
> prediction softwares that I know of make this assumption and the VCF
> format itself seems to make something of an implicit assumption on
> this point.

That makes a lot of sense.

In addition I think it would be nice if the GRanges object returned
by predictCoding() was "stranded" and if the varAllele columns was
reporting the allele with respect to the strand. Just to clarify, this
would be the strand of the CDS where the translation if happening, not
the strand specified in the input GRanges object. For example in
Jeremiah's example, we would get something like:

 > predictCoding(GR, txdb, Hsapiens, DNAStringSet("T"))
GRanges with 4 ranges and 12 elementMetadata cols:
       seqnames               ranges strand |      varAllele ...
          <Rle>            <IRanges>  <Rle> | <DNAStringSet> ...
   [1]    chr17 [63554271, 63554271]      - |              A ...
   [2]    chr17 [63554271, 63554271]      - |              A ...
   [3]    chr17 [63554271, 63554271]      - |              A ...
   [4]    chr17 [63554271, 63554271]      - |              A ...

So what we see in the varAllele column would be consitent with what
we see in the refCodon and varCodon columns.

Cheers,
H.

>
> Sean
>
>> thanks in advance,
>>
>> Jeremiah
>>
>> here is my session info incase it is needed:
>>
>>> sessionInfo()
>> R Under development (unstable) (2012-04-11 r58985)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>   [7] LC_PAPER=C                 LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>>   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1
>>   [2] GenomicFeatures_1.8.1
>>   [3] AnnotationDbi_1.18.0
>>   [4] Biobase_2.16.0
>>   [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17
>>   [6] BSgenome_1.24.0
>>   [7] VariantAnnotation_1.2.5
>>   [8] Rsamtools_1.8.1
>>   [9] Biostrings_2.24.1
>> [10] GenomicRanges_1.8.3
>> [11] IRanges_1.14.2
>> [12] BiocGenerics_0.2.0
>>
>> loaded via a namespace (and not attached):
>>   [1] biomaRt_2.12.0     bitops_1.0-4.1     compiler_2.16.0
>> DBI_0.2-5
>>   [5] grid_2.16.0        lattice_0.20-6     Matrix_1.0-6
>> RCurl_1.91-1
>>   [9] RSQLite_0.11.1     rtracklayer_1.16.1 snpStats_1.6.0
>> splines_2.16.0
>> [13] stats4_2.16.0      survival_2.36-12   tools_2.16.0
>> XML_3.9-4
>> [17] zlibbioc_1.2.0
>>
>> --
>> Jeremiah Degenhardt, Ph.D.
>> Computational Biologist
>> Bioinformatics and Computational Biology
>> Genentech, Inc.
>> degenhardt.jeremiah at gene.com
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list