[BioC] VariantAnnotation::readVcf is reordering the FORMAT field of my VCF file

Valerie Obenchain vobencha at fhcrc.org
Tue Mar 5 05:25:31 CET 2013


Hi Pete,

The order is taken from the header. readVcf() attempts to assign data 
type and order based on header information. If there is no header line 
then no data type is assigned and the variable is left unparsed.

 > hd <- scanVcfHeader('readVcf_problem.vcf.gz')
 > geno(hd)
DataFrame with 5 rows and 3 columns
         Number        Type
    <character> <character>
AD           .     Integer
DP           1     Integer
GQ           1     Integer
GT           1      String
PL           G     Integer
...
...

 > vcf <- readVcf("readVcf_problem.vcf.gz", "hg19")
 > geno(vcf)
List of length 5
names(5): AD DP GQ GT PL

I haven't run into 'GT' being out of order before - or at least causing 
problems downstream. Your quick fix is to move your GT header line above 
'AD'. I will look into keeping 'GT' first reguardles of header order.

Thanks for pointing this out.
Valerie


On 03/04/13 18:32, hickey at wehi.EDU.AU wrote:
> Hi Valerie,
>
> readVcf is reordering the fields in the FORMAT field of my VCF file 
> which is causing me problems downstream. I am using VariantAnnotation 
> to subset my VCF file and then write the subset to file using 
> writeVcf. However, the reordering of the FORMAT fields means by new 
> VCF does not comply with VCF spec, specifically the GT-field is not 
> the first sub-field of the FORMAT field. A reproducible example follows.
>
> Thanks,
> Pete
>
> > sessionInfo()
> R version 2.15.2 Patched (2013-02-10 r61900)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] VariantAnnotation_1.4.9 Rsamtools_1.10.2  Biostrings_2.26.3       
> GenomicRanges_1.10.7
> [5] IRanges_1.16.6          BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
>  [1] AnnotationDbi_1.20.5   Biobase_2.18.0 biomaRt_2.14.0         
> bitops_1.0-4.2
>  [5] BSgenome_1.26.1        DBI_0.2-5  GenomicFeatures_1.10.2 
> parallel_2.15.2
>  [9] RCurl_1.95-3           RSQLite_0.11.2 rtracklayer_1.18.2     
> stats4_2.15.2
> [13] tools_2.15.2           XML_3.95-0.1 zlibbioc_1.4.0
>
> #### Bug report: readVcf() reordering FORMAT column of VCF file in 
> violation of VCF spec ####
> library(VariantAnnotation)
> ## Download my example VCF
> download.file('http://dl.dropbox.com/u/17421746/readVcf_problem.vcf.gz', 
> destfile = '/tmp/readVcf_problem.vcf.gz')
> ## Cat the VCF: Note that the FORMAT column is GT:AD:DP:GQ:PL
> system(command = 'zcat /tmp/readVcf_problem.vcf.gz') # Might need to 
> alter this command depending on system; works on my linux machine but 
> need 'gzcat' on my mac
> ## Read in the VCF using readVcf()
> problem_vcf <- readVcf('/tmp/readVcf_problem.vcf.gz', genome = 'mm9')
> ## Note the order of geno(), which stores each field in the FORMAT 
> column as a list element, is "AD DP GQ GT PL"
> geno(problem_vcf)
> ## Write the file back to disk using writeVcf()
> writeVcf(problem_vcf, '/tmp/problem.vcf', index = TRUE)
> ## Cat the newly written version of the VCF: Note that the FORMAT 
> field is now AD:DP:GQ:GT:PL, which is contrary to VCF spec which says 
> of the FORMAT field: "The first sub-field must always be the genotype 
> (GT) if it is present" (see 
> http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41)
> system(command = 'zcat /tmp/problem.vcf.gz') # Might need to alter 
> this command depending on system; works on my linux machine but need 
> 'gzcat' on my mac
>
> ## No such problem with the example VCF
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> good_vcf <- readVcf(fl, genome = 'hg19')
> ## Sub-fields of FORMAT field are correctly ordered
> geno(good_vcf)
>
> --------------------------------
> Peter Hickey,
> PhD Student/Research Assistant,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Ph: +613 9345 2324
>
> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
> http://www.wehi.edu.au
>
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:6}}



More information about the Bioconductor mailing list