[BioC] VariantAnnotation: Performance and memory issues in readVcf

Valerie Obenchain vobencha at fhcrc.org
Wed May 15 18:12:35 CEST 2013


Hi Ulrich,

On 05/15/2013 08:54 AM, Ulrich Bodenhofer wrote:
> Vincent,
>
> Thanks for your fast and enlightening reply! You are absolutely right
> that the key is to focus on the absolute minimum of necessary
> information, and I admit I should have tried that before posting the
> question. The results are the following:
>
>       > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000,
>      end=1500000))
>       > sp <- ScanVcfParam(which=gr, info=NA, geno="GT", fixed="ALT")
>       >
>       > system.time(res4 <- readVcf(tf, "hg19", param=sp))
>          user  system elapsed
>         5.655   0.054   5.722
>      Warning messages:
>      1: In .bcfHeaderAsSimpleList(header) :
>         duplicate keys in header will be forced to unique rownames
>      2: In .bcfHeaderAsSimpleList(header) :
>         duplicate keys in header will be forced to unique rownames
>
> So the whole thing speeds up by a factor of 100, which is indeed
> dramatic. Of the 5.7 seconds, roughly 0.5 seconds are spent on reading
> and parsing the file into a large nested list, whereas 5.2 seconds are
> spent on converting the large nested list into an S4 object. I could
> argue about whether this ratio is adequate, but I feel I don't need to,
> as the 5.7 seconds appear feasible to me. In the other case that
> readVcf() processes all data, the ratio is 0.5 to 539.5. I think this
> is, in any case, a weird ratio, no matter how complicated the
> information in a VCF file is, as all information is already contained in
> the list that is returned by the call to scanTabix() in .vcf_scan(), in
> a well-structured manner and mostly in the necessary data types.

INFO variables with multiple values are parsed into CompressedLists. 
FORMAT variables with multiple values are parsed into arrays. We thought 
these parsed classes were more beneficial to the user for further 
exploration/analysis. If you only want the list form you can use 
scanBam() instead of readVcf(). scanBam() also takes a 'param' argument 
so you can subset on position, variables etc.

>
> Different question: I have little idea what .makeMessage() does. Is it
> possible that a large proportion of the time I reported is spent by
> throwing warnings? The warnings produced by my original readVcf() call
> are the same as above, but many more of them are thrown. I suppose they
> are due to the fact that my VCF file is the result of merging multiple
> single-sample VCF files, and the header of the merged file promerged frombably
> contains a lot of redundant information. I do not know why one variant
> throws two such warnings, while the other variant throws 50+ (probably
> many more). I tried suppressWarnings(), but that just made the warnings
> disappear, yet did not accelarate readVcf().

The warning from .bcfHeaderAsSimpleList() is thrown when retrieving the 
header information from the file. You can replicate the warning with this,

hdr <- scanVcfHeader(tf)

This is not an expensive operation and is performed twice during 
readVcf(). Throwing two warnings is overkill and I will fix that.  The 
warning itself is saying you have duplicate variables names in the 
header. As you mentioned, this is probably because the file is the 
results of a merge.

You mention that one file gives you 2 warnings but another gives you 50. 
Are the other 50 warnings the same?

Valerie


>
> Thanks and best regards,
> Ulrich
>
>
> On 05/15/2013 01:29 PM, Vincent Carey wrote:
>> While I agree that achievable performance improvements should be
>> sought on their own terms, I have a feeling that you will not get very
>> far without a bit more specification of what you want readVcf to
>> produce.  My sense is that you can get much better performance if you
>> set the ScanVcfParam and control the capture of INFO fields.  There
>> may be other fields that you do not need.  I'd propose that we take a
>> public vcf file like a 1000 genomes vcf (i happen to
>> have ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz,
>> maybe you can provide a better one) and define some target of the
>> import task.
>>
>>> tf =
>> TabixFile("ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz")
>>> mysvp = ScanVcfParam(which=GRanges("17", IRanges(1,100000)),
>> info=NA, geno="GT", fixed="ALT")
>>> unix.time(r1 <- readVcf(tf, "hg19", param=mysvp))
>>     user  system elapsed
>>    1.468   0.520   1.994
>> Warning messages:
>> 1: In .bcfHeaderAsSimpleList(header) :
>>    duplicate keys in header will be forced to unique rownames
>> 2: In .bcfHeaderAsSimpleList(header) :
>>    duplicate keys in header will be forced to unique rownames
>>> mysvp2 = ScanVcfParam(which=GRanges("17", IRanges(1,100000))) # no
>> info/geno control
>>> unix.time(r2 <- readVcf(tf, "hg19", param=mysvp2))
>>     user  system elapsed
>>   36.704   1.959  38.860
>>
>> have a look at r2
>>
>> class: CollapsedVCF
>> dim: 1680 1092
>> rowData(vcf):
>>    GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
>> info(vcf):
>>    DataFrame with 22 columns: LDAF, AVGPOST, RSQ, ERATE, THETA, CIEND,
>> CIPOS,...
>> info(header(vcf)):
>>               Number Type    Description
>>     LDAF      1      Float   MLE Allele Frequency Accounting for LD
>>     AVGPOST   1      Float   Average posterior probability from
>> MaCH/Thunder
>>     RSQ       1      Float   Genotype imputation quality from MaCH/Thunder
>>     ERATE     1      Float   Per-marker Mutation rate from MaCH/Thunder
>>     THETA     1      Float   Per-marker Transition rate from MaCH/Thunder
>>     CIEND     2      Integer Confidence interval around END for
>> imprecise va...
>>     CIPOS     2      Integer Confidence interval around POS for
>> imprecise va...
>>     END       1      Integer End position of the variant described in
>> this r...
>>     HOMLEN    .      Integer Length of base pair identical
>> micro-homology at...
>>     HOMSEQ    .      String  Sequence of base pair identical
>> micro-homology ...
>>     SVLEN     1      Integer Difference in length between REF and ALT
>> alleles
>>     SVTYPE    1      String  Type of structural variant
>>     AC        .      Integer Alternate Allele Count
>>     AN        1      Integer Total Allele Count
>>     AA        1      String  Ancestral Allele,
>> ftp://ftp.1000genomes.ebi.ac....
>>     AF        1      Float   Global Allele Frequency based on AC/AN
>>     AMR_AF    1      Float   Allele Frequency for samples from AMR
>> based on ...
>>     ASN_AF    1      Float   Allele Frequency for samples from ASN
>> based on ...
>>     AFR_AF    1      Float   Allele Frequency for samples from AFR
>> based on ...
>>     EUR_AF    1      Float   Allele Frequency for samples from EUR
>> based on ...
>>     VT        1      String  indicates what type of variant the line
>> represents
>>     SNPSOURCE .      String  indicates if a snp was called when
>> analysing th...
>> geno(vcf):
>>    SimpleList of length 3: GT, DS, GL
>> geno(header(vcf)):
>>        Number Type   Description
>>     GT 1      String Genotype
>>     DS 1      Float  Genotype dosage from MaCH/Thunder
>>     GL .      Float  Genotype Likelihoods
>>
>> there's a ton of information in there, and the volume is not
>> predictable on the basis of the knowledge that we are  confronting
>> VCF.  so the fact that readVcf can manage this is good and maybe worth
>> a performance hit.  if setting the ScanVcfParam doesn't adequately
>> beat the approach of a drop-in parser with scanTabix, we should try to
>> build out that approach a bit.  But it all depends on what you want to
>> get out.
>>
>> On Tue, May 14, 2013 at 11:59 AM, Ulrich Bodenhofer
>> <bodenhofer at bioinf.jku.at <mailto:bodenhofer at bioinf.jku.at>> wrote:
>>
>>      Hi,
>>
>>      I am currently working on a package that takes input from VCF
>>      files (possibly large ones) and I wanted to use readVcf() from the
>>      VariantAnnotation package for reading the data. However, I have
>>      run into severe performance and memory issues. The following test
>>      report is based on a bgzipped and tabix-indexed VCF file with 100
>>      samples/columns and approx 1,9 millions of variants/rows. I have
>>      previously failed to load data of this size entirely using
>>      readVcf(), so I read the data in chunks. I always had the
>>      impression that this was quite slow, but now I compared it with
>>      tabix on the command line and scanTabix() from the Rsamtools
>>      packages and the results were stunning. Here are some code chunks.
>>      As you can see, I am trying to read a 1,400,001bp region from
>>      chr1. This region actually contains 8,757 variants/rows:
>>
>>          > tf <- TabixFile("testFile100.vcf.gz")
>>          > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000,
>>         end=1500000))
>>          >
>>          > system.time(res1 <- readVcf(tf, "hg19", param=gr))
>>             user  system elapsed
>>         540.080   0.624 541.906
>>         There were 50 or more warnings (use warnings() to see the first 50)
>>
>>      The scanTabix() function from the Rsamtools package gives the
>>      following result:
>>
>>          > system.time(res2 <- scanTabix(tf, param=gr))
>>             user  system elapsed
>>            0.170   0.002   0.171
>>
>>      The tabix command line tool takes approximately the same amount of
>>      time. I admit that scanTabix() just reads the data and does no
>>      parsing or any other processing, so I digged deeper and saw that
>>      readVcf() calls scanTabix() inside the function .vcf_scan().
>>      Interestingly, this is done in a way that all the parsing is done
>>      inside the scanTabix() function by supplying a function that does
>>      the parsing through the undocumented tbxsym argument. So I tried
>>      the same approach:
>>
>>          > maps <- VariantAnnotation:::.vcf_scan_header_maps(tf,
>>         fixed=character(), info=NA,
>>         + geno="GT")
>>         Warning message:
>>         In .bcfHeaderAsSimpleList(header) :
>>            duplicate keys in header will be forced to unique rownames
>>          > tbxstate <- maps[c("samples", "fmap", "imap", "gmap")]
>>          > tbxsym <- getNativeSymbolInfo(".tabix_as_vcf",
>>      "VariantAnnotation")
>>          >
>>          > system.time(res3 <- scanTabix(tf, param=gr, tbxsym=tbxsym,
>>         tbxstate=tbxstate))
>>             user  system elapsed
>>            0.511   0.006   0.517
>>
>>      So, even if I include the same way of parsing VCF data as
>>      readVcf(), the total time is still in the range of half a second,
>>      which is still approx. 1000 times faster than calling readVcf().
>>      So where is all the performance lost? I can hardly imagine that
>>      539.5 of 540 seconds are spent on data transformations.
>>
>>      A similar situation can be observed in terms of memory
>>      consumption: after having loaded the VariantAnnotation package
>>      (and all packages it depends upon), my R process occupies about
>>      185MB main memory. Reading and parsing the data with scanTabix()
>>      increases the memory consumption by about 40MB, whereas calling
>>      readVcf() increases the memory consumption by more than 400MB . I
>>      do not think that such an amount of memory is needed to
>>      accommodate the output object res3; object.size() says it's about
>>      8MB, but I know that these figure need not be accurate.
>>
>>      Actually, I do not need the whole flexibility of readVcf(). If
>>      necessary, I would be satisfied with a workaround like the one
>>      based on scanTabix() above. However, I do not like the idea too
>>      much to use undocumented internals of other packages in my
>>      package. If possible, I would rather prefer to rely on readVcf().
>>      So, I would be extremely grateful if someone could either explain
>>      the situation or, in case there are bugs, fix them.
>>
>>      Thanks and best regards,
>>      Ulrich
>>
>>
>>      ------------------------------------------------------------------------
>>      *Dr. Ulrich Bodenhofer*
>>      Associate Professor
>>      Institute of Bioinformatics
>>
>>      *Johannes Kepler University*
>>      Altenberger Str. 69
>>      4040 Linz, Austria
>>
>>      Tel. +43 732 2468 4526 <tel:%2B43%20732%202468%204526>
>>      Fax +43 732 2468 4539 <tel:%2B43%20732%202468%204539>
>>      bodenhofer at bioinf.jku.at <mailto:bodenhofer at bioinf.jku.at>
>>      <mailto:bodenhofer at bioinf.jku.at <mailto:bodenhofer at bioinf.jku.at>>
>>      http://www.bioinf.jku.at/ <http://www.bioinf.jku.at>
>>
>>      _______________________________________________
>>      Bioconductor mailing list
>>      Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>      https://stat.ethz.ch/mailman/listinfo/bioconductor
>>      Search the archives:
>>      http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list