[BioC] VariantAnnotation: Performance and memory issues in readVcf

Ulrich Bodenhofer bodenhofer at bioinf.jku.at
Tue May 14 17:59:55 CEST 2013


Hi,

I am currently working on a package that takes input from VCF files 
(possibly large ones) and I wanted to use readVcf() from the 
VariantAnnotation package for reading the data. However, I have run into 
severe performance and memory issues. The following test report is based 
on a bgzipped and tabix-indexed VCF file with 100 samples/columns and 
approx 1,9 millions of variants/rows. I have previously failed to load 
data of this size entirely using readVcf(), so I read the data in 
chunks. I always had the impression that this was quite slow, but now I 
compared it with tabix on the command line and scanTabix() from the 
Rsamtools packages and the results were stunning. Here are some code 
chunks. As you can see, I am trying to read a 1,400,001bp region from 
chr1. This region actually contains 8,757 variants/rows:

     > tf <- TabixFile("testFile100.vcf.gz")
     > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000,
    end=1500000))
     >
     > system.time(res1 <- readVcf(tf, "hg19", param=gr))
        user  system elapsed
    540.080   0.624 541.906
    There were 50 or more warnings (use warnings() to see the first 50)

The scanTabix() function from the Rsamtools package gives the following 
result:

     > system.time(res2 <- scanTabix(tf, param=gr))
        user  system elapsed
       0.170   0.002   0.171

The tabix command line tool takes approximately the same amount of time. 
I admit that scanTabix() just reads the data and does no parsing or any 
other processing, so I digged deeper and saw that readVcf() calls 
scanTabix() inside the function .vcf_scan(). Interestingly, this is done 
in a way that all the parsing is done inside the scanTabix() function by 
supplying a function that does the parsing through the undocumented 
tbxsym argument. So I tried the same approach:

     > maps <- VariantAnnotation:::.vcf_scan_header_maps(tf,
    fixed=character(), info=NA,
    +                                                   geno="GT")
    Warning message:
    In .bcfHeaderAsSimpleList(header) :
       duplicate keys in header will be forced to unique rownames
     > tbxstate <- maps[c("samples", "fmap", "imap", "gmap")]
     > tbxsym <- getNativeSymbolInfo(".tabix_as_vcf", "VariantAnnotation")
     >
     > system.time(res3 <- scanTabix(tf, param=gr, tbxsym=tbxsym,
    tbxstate=tbxstate))
        user  system elapsed
       0.511   0.006   0.517

So, even if I include the same way of parsing VCF data as readVcf(), the 
total time is still in the range of half a second, which is still 
approx. 1000 times faster than calling readVcf(). So where is all the 
performance lost? I can hardly imagine that 539.5 of 540 seconds are 
spent on data transformations.

A similar situation can be observed in terms of memory consumption: 
after having loaded the VariantAnnotation package (and all packages it 
depends upon), my R process occupies about 185MB main memory. Reading 
and parsing the data with scanTabix() increases the memory consumption 
by about 40MB, whereas calling readVcf() increases the memory 
consumption by more than 400MB . I do not think that such an amount of 
memory is needed to accommodate the output object res3; object.size() 
says it's about 8MB, but I know that these figure need not be accurate.

Actually, I do not need the whole flexibility of readVcf(). If 
necessary, I would be satisfied with a workaround like the one based on 
scanTabix() above. However, I do not like the idea too much to use 
undocumented internals of other packages in my package. If possible, I 
would rather prefer to rely on readVcf(). So, I would be extremely 
grateful if someone could either explain the situation or, in case there 
are bugs, fix them.

Thanks and best regards,
Ulrich


------------------------------------------------------------------------
*Dr. Ulrich Bodenhofer*
Associate Professor
Institute of Bioinformatics

*Johannes Kepler University*
Altenberger Str. 69
4040 Linz, Austria

Tel. +43 732 2468 4526
Fax +43 732 2468 4539
bodenhofer at bioinf.jku.at <mailto:bodenhofer at bioinf.jku.at>
http://www.bioinf.jku.at/ <http://www.bioinf.jku.at>



More information about the Bioconductor mailing list