[BioC] Why is writeVcf() slow?

Valerie Obenchain vobencha at fhcrc.org
Wed Oct 24 04:30:03 CEST 2012


Hi Peter,

I'd say yes, the warning is a cause for concern. Is the vcf you're 
working with publicly available?  If not, can you send a smaller subset 
that still produces the error?

Valerie


On 10/23/12 17:29, hickey at wehi.EDU.AU wrote:
> I am using the writeVcf() function in the VariantAnnotation package, which I understand to be 'under construction' from the documentation. I am wondering why writeVcf() is taking so long to do it's job and if there is anything I can do to speed this up? For example, I have a VCF object containing some 36,000 variants that I want to write to file and this takes nearly 3 hours. It also produces a warning that I am unsure how to interpret, i.e. should I be worried by it? The VCF written to disk appears to be valid.
>
> Many thanks for your help.
>
> ## Code
>> library(VariantAnnotation)
>> vcf<- readVcf(file = 'UnifiedGenotyper.autosomal.snps.indels.vcf.gz', genome = 'mm9')
>> vcf
> class: VCF
> dim: 292155 7
> genome: mm9
> exptData(1): header
> fixed(4): REF ALT QUAL FILTER
> info(21): AC AF ... SB STR
> geno(5): AD DP GQ GT PL
> rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696
>    chr19:61340708
> rowData values names(1): paramRangeID
> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409
> colData names(1): Samples
>
> ## Some subsetting of the 'vcf' object to create vcf.2aff.constrained
>
>> vcf.2aff.constrained
> class: VCF
> dim: 35514 7
> genome: mm9
> exptData(1): header
> fixed(4): REF ALT QUAL FILTER
> info(21): AC AF ... SB STR
> geno(5): AD DP GQ GT PL
> rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449
>    chr19:61340453
> rowData values names(1): paramRangeID
> colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385 506/Bcl2#409
> colData names(1): Samples
>
>> system.time(writeVcf(obj = vcf.2aff.constrained, filename = 'UnifiedGenotyper.autosomal.snps.indels.constrained.genotypes.2aff.vcf'))
>       user    system   elapsed
> 10653.768     0.244 10659.163
> Warning message:
> In .Method(..., deparse.level = deparse.level) :
>    number of rows of result is not a multiple of vector length (arg 1)
>
>> sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1        Biostrings_2.26.2
> [4] GenomicRanges_1.10.2    IRanges_1.16.2          BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
>   [1] AnnotationDbi_1.20.1   Biobase_2.18.0         biomaRt_2.14.0
>   [4] bitops_1.0-4.1         BSgenome_1.26.1        DBI_0.2-5
>   [7] GenomicFeatures_1.10.0 parallel_2.15.1        RCurl_1.95-1.1
> [10] RSQLite_0.11.2         rtracklayer_1.18.0     stats4_2.15.1
> [13] tools_2.15.1           XML_3.95-0.1           zlibbioc_1.4.0
>
> --------------------------------
> Peter Hickey,
> PhD Student/Research Assistant,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Ph: +613 9345 2324
>
> hickey at wehi.edu.au
> http://www.wehi.edu.au
>
>
> ______________________________________________________________________
> The information in this email is confidential and intend...{{dropped:8}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list