[BioC] VariantAnnotation: Specifying 'seqinfo' at import with 'readVcf'

Hervé Pagès hpages at fhcrc.org
Sat Sep 28 00:57:26 CEST 2013


Hi Julian,

With the latest devel version of GenomicRanges (1.13.48, should become
available via biocLite() in the next 24 hours or so), setting the
seqinfo of a big VCF object should be much faster but your mileage
may vary. Let us know if you still find it slow enough to justify a
mechanism for letting the user specify the seqinfo upfront when using
readVcf() (e.g. the 'genome' arg could be renamed 'seqinfo' and accept
everything it supports now plus a Seqinfo object).

Cheers,
H.

On 09/24/2013 10:00 AM, Julian Gehring wrote:
> Hi Valerie,
>
>> Thanks for clarifying. No, there is not currently a way to do this.
>>
>> The 'seqinfo' on the rowData(vcf) should not be difficult to change. Can
>> you provide more detail as to (1) why you are changing it (did readVcf()
>> import something incorrectly or ?) and (2) what operations on the
>> 'seqinfo' are taking a long time.
>
> 'readVcf' did everything in a correct way.  I needed to add the
> information about the genome, circularity, and seqlengths based on
> external annotitation, since it is not part of the VCF file.
>
> I can't tell you at the moment where 'seqinfo <-' spends most of its
> time.  I'll profile it and get back to you.
>
> Thank your for your quick answer and your help!
>
> Best wishes
> Julian
>
>
>> Thanks.
>> Valerie
>>
>>>
>>> Best wishes
>>> Julian
>>>
>>>
>>> On 09/24/2013 06:31 PM, Valerie Obenchain wrote:
>>>> Hi Julian,
>>>>
>>>> On 09/24/2013 02:29 AM, Julian Gehring wrote:
>>>>> Hi,
>>>>>
>>>>> Is there a direct way to specifiy the 'seqinfo' of a genome for the
>>>>> import of a VCF file using 'readVcf'?
>>>>
>>>> I think the question is how to read in a subset of
>>>> chromosomes/positions
>>>> from a vcf file without an accompanying tabix index. You can't.
>>>> readVcf() requires an index when subsets are defined by
>>>> chromosome/position. However you can read in subsets defined by INFO
>>>> and/or GENO fields without an index.
>>>>
>>>> Approaches:
>>>> (1) create index with ?indexTabix and specify 'which' in ScanVcfParam
>>>> (2) use ?filterVcf to write out a new file of records of interest
>>>>
>>>>> I'm aware that one can change it
>>>>> with the 'seqinfo' method afterwards, but for large VCF files this can
>>>>> take a significant amount of time.
>>>>
>>>> What operation is taking along time? Subsetting the VCF object by
>>>> chromosome?
>>>>
>>>> Valerie
>>>>
>>>>>
>>>>> An alternative would be to sneak it in by the 'which' arguments, such
>>>>> as:
>>>>>
>>>>> readVcf(file, genome, ScanVcfParam(which = as(seq_info, "GRanges")))
>>>>>
>>>>> but this requires the file to be indexed beforehand.
>>>>>
>>>>> Best wishes
>>>>> Julian
>>>>>
>>>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list