[BioC] VCF predictCoding(...) providing inconsistent results?

Valerie Obenchain vobencha at fhcrc.org
Wed Jan 23 06:02:16 CET 2013


Hi Murat,

Sorry for the delayed response, I've been out of town. I can't reproduce 
this error because I don't have the alleles for the range. Can you send 
a small subset of the VCF?

The seqlevel preprocessing is still necessary to make the annotation and 
VCF file match. This can be done with renameSeqlevels() or seqlevels(). 
However removing the structural variants is no longer needed in >= 
VariantAnnotation 1.5.28 in the devel branch. From that version on 
predictCoding() will ignore structural variants and return results for 
non-structural only.

Valerie


On 01/15/13 13:57, Murat Tasan wrote:
> i should add that i actually had to pre-process the vcf file a bit to
> remedy two problems:
>
> (1) i changed the seqlevels like so:
>
>    seqlevels(vcf)<- sprintf("chr%s", seqlevels(vcf))
>
> and (2) i had to remove structural variants, but i think i did this
> without corrupting the resulting vcf structure... here's that bit of
> code:
>
>    structural_lix<- rep(FALSE, length(alt(vcf)))
>    structural_lix[sapply(alt(vcf), function(x) any(grepl("<", x)))]<- TRUE
>    table(structural_lix) ## just to see how many entries will be
> removed.
>
>    vcf<- vcf[!structural_lix,]
>    alt(vcf)<- VariantAnnotation:::.toDNAStringSetList(unlist(alt(vcf),
> use.names = FALSE))
>
> (the structural-variant removal code was inspired and closely follows
> that described here:
> https://stat.ethz.ch/pipermail/bioconductor/2012-October/048448.html)
>
> after these two steps, the vcf object seemed to be perfectly happy,
> until the predictCoding(...) strangeness described in the previous
> post.
>
> cheers,
>
> -m
>
>
> On Tue, Jan 15, 2013 at 4:34 PM, Murat Tasan<mmuurr at gmail.com>  wrote:
>> hi all - i found a strange case while examining a VCF object extracted
>> from chromosome 1 of the 1000 Genomes Project "integrated_call_set"
>> files.
>> i first extracted all variants falling in CDS regions of some of my
>> genes of interest into a vcf object 'vcf'.
>>
>> then i ran predictCoding(vcf, TXDB, Hsapiens), where TXDB and Hsapiens
>> come from TxDb.Hsapiens.UCSC.hg19.knownGene and
>> BSgenome.Hsapiens.UCSC.hg19, respectively:
>>
>> vcf_coding_info_GRanges<- predictCoding(vcf, TXDB, Hsapiens).
>>
>> now, examining a single variant i got pretty confused, from two cases:
>>
>> CASE 1
>>
>>> vcf_coding_info_GRanges[names(vcf_coding_info_GRanges) == "rs78192809",]
>> GRanges with 6 ranges and 13 metadata columns:
>>               seqnames                 ranges strand |     paramRangeID
>>       varAllele       CDSLOC              PROTEINLOC   QUERYID
>> TXID     CDSID      GENEID   CONSEQUENCE
>>                  <Rle>               <IRanges>   <Rle>  |<factor>
>> <DNAStringSet>     <IRanges>  <CompressedIntegerList>  <integer>
>> <character>  <integer>  <character>       <factor>
>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>               T [1564, 1564]                     522      9426
>> 6991     20852       89866      nonsense
>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>               T [1988, 1988]                     663      9426
>> 6992     20852       89866 nonsynonymous
>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>               T [3008, 3008]                    1003      9426
>> 6993     20852       89866 nonsynonymous
>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>               T [1985, 1985]                     662      9426
>> 6994     20852       89866 nonsynonymous
>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>               T [3011, 3011]                    1004      9426
>> 6995     20852       89866    synonymous
>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>               T [2039, 2039]                     680      9426
>> 6996     20852       89866    synonymous
>>                     REFCODON       VARCODON         REFAA         VARAA
>>               <DNAStringSet>  <DNAStringSet>  <AAStringSet>  <AAStringSet>
>>    rs78192809            CAG            TAG             Q             *
>>    rs78192809            CCA            CTA             P             L
>>    rs78192809            CCA            CTA             P             L
>>    rs78192809            CCA            CTA             P             L
>>    rs78192809            CCA            CCA             P             P
>>    rs78192809            CCA            CCA             P             P
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>
>> CASE 2: i extracted the subset of the original vcf object and ran
>> predictCoding(...) on that sub-object:
>>
>>> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
>> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
>> GRanges with 6 ranges and 13 metadata columns:
>>               seqnames                 ranges strand |     paramRangeID
>>       varAllele       CDSLOC              PROTEINLOC   QUERYID
>> TXID     CDSID      GENEID   CONSEQUENCE
>>                  <Rle>               <IRanges>   <Rle>  |<factor>
>> <DNAStringSet>     <IRanges>  <CompressedIntegerList>  <integer>
>> <character>  <integer>  <character>       <factor>
>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>               T [1564, 1564]                     522         1
>> 6991     20852       89866      nonsense
>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>               T [1988, 1988]                     663         1
>> 6992     20852       89866 nonsynonymous
>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>               T [3008, 3008]                    1003         1
>> 6993     20852       89866 nonsynonymous
>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>               T [1985, 1985]                     662         1
>> 6994     20852       89866 nonsynonymous
>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>               T [3011, 3011]                    1004         1
>> 6995     20852       89866 nonsynonymous
>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>               T [2039, 2039]                     680         1
>> 6996     20852       89866 nonsynonymous
>>                     REFCODON       VARCODON         REFAA         VARAA
>>               <DNAStringSet>  <DNAStringSet>  <AAStringSet>  <AAStringSet>
>>    rs78192809            CAG            TAG             Q             *
>>    rs78192809            CCA            CTA             P             L
>>    rs78192809            CCA            CTA             P             L
>>    rs78192809            CCA            CTA             P             L
>>    rs78192809            CCA            CTA             P             L
>>    rs78192809            CCA            CTA             P             L
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>
>> for nearly all fields, especially the CDSID, TXID, and the location
>> data of the variant in the sequences, the results are the same.
>> BUT, the CONSEQUENCE, VARCODON, and VARAA are inconsistent between the
>> two calls.
>>
>> have i done something wrong in the second attempt (i.e. does
>> subsetting on the VCF variants first cause a problem when submitting
>> the result to predictCoding(...))?
>>
>> thanks for any help/insight.
>>
>> -m
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list