[BioC] Problem reading VCF file using readVcf from package VariantAnnotation

Ulrich Bodenhofer bodenhofer at bioinf.jku.at
Thu Apr 25 10:56:11 CEST 2013


Hi Valerie,

First of all, thanks a lot for your very quick and helpful reply! Your 
hint to the function from GenomicRanges helped me to find the problem. 
So, finally, I can safely say that there is no problem in readVcf that 
caused this issue. The problem was that I supplied a single character 
string "b37" as genome argument to readVcf(). I did not pay attention to 
whether this string has a name attribute (which was, unfortunately, the 
case). Obviously, readVcf checks whether the names of the 'genome' 
argument match the sequence names in the VCF file. Knowing that, the 
error message makes perfect sense to me! I now stripped off the names 
from the string I am passing as 'genome' argument, and it works great! 
Interestingly, however, if I pass a whole vector to the 'genome' 
argument, the error persists even if the sequence names match perfectly. 
Anyway, I don't care, my solution to pass a single string without a name 
is absolutely acceptable for me.

Thanks again and best regards,
Ulrich

On 04/24/2013 06:06 PM, Valerie Obenchain wrote:
> Hi Ulrich,
>
> In addition to reading genomic regions with 'which' in ScanVcfParam(), 
> you can also use 'yieldSize' in TabixFile() to iterate through a VCF 
> file in chunks. See example at the bottom of ?readVcf man page or 
> ?TabixFile.
>
> The error you are seeing is thrown from a function in GenomicRanges, 
> specifically .normargSeqlengths(). This function is called when 
> checking compatibility of the Seqinfo object(s).
>
> How was the GRanges 'param' created? Did it come from a different 
> genome? I'm not clear on why the GRanges has potentially different 
> chromosome names, lengths and genome? If the VCF file is 'b37', all 
> data (names, lengths, genome if present) in the GRanges must be 
> compatible with that.
>
> Please provide a sample of the code you've tried and the output error. 
> Specifically, it would be helpful to see seqinfo(GRanges) and 
> meta(header(vcf)).
>
> Valerie
>
> On 04/24/2013 04:53 AM, Ulrich Bodenhofer wrote:
>> Hi,
>>
>> I am trying to read genotype data from a large VCF file using the
>> readVcf() function from the VariantAnnotation package. I am not reading
>> the entire file (which would crash my R session because of a lack of
>> memory). Instead, I am reading bunches of SNV data located in 200kbp
>> regions which I specify by passing a GRanges object to ScanVcfParams()
>> first. No matter what I do, I get the following error message:
>>
>>     when the supplied 'genome' vector is named, the names must match the
>> seqnames
>>
>> As far as I can make sense of this message, it seems that there is some
>> mismatch between the genome characteristics in my GRanges object and the
>> genome characteristics in the VCF file. I dissected the R object
>> returned by scanVcfHeader() and indeed found some interesting
>> mismatches: The genome in the VCF file is denoted as "b37" and the
>> sequence names are not 100% compatible with hg19. The lengths of
>> chromosomes 1-22, X, and Y do match, but the lengths of mitochondrial
>> DNA (denoted "M" in gh19 and "MT" in b37) differ by 2. So I forced my
>> GRanges object to be 100% compatible with the information stored in the
>> VCF file (by copying seqlevels, genome, and seqlengths) and restricted
>> my analysis to chromosomes 1-22 and X. However, I still get the same
>> error message.
>>
>> I also tried to locate the error message in the source code of the
>> VariantAnnotation package to understand better what the problem is, but
>> I could not find it. It seems the message is produced by a function that
>> VariantAnnotation calls from another package.
>>
>> Any idea?
>>
>> Thanks in advance and best regards,
>> Ulrich



More information about the Bioconductor mailing list