[BioC] VariantAnnotation: Specifying 'seqinfo' at import with 'readVcf'

Tue Oct 1 17:39:06 CEST 2013

Hi,

On 09/30/2013 04:53 AM, Julian Gehring wrote:
> Hi Valerie and Herve,
>
> Your latest changes improve the speed of this greatly, thanks for your
> work!
>
> Thinking more generally, do you think it would be worth generalizing the
> readVcf in the way you described?  I'm not just thinking about speed
> here, but also about a more flexible way of interacting with the
> functions (given that the user has to specify the genome name anyway).

Yes. I'll aim to implement this before the next release.

Valerie

>
> Best wishes
> Julian
>
>
> On 09/28/2013 12:57 AM, Hervé Pagès wrote:
>> Hi Julian,
>>
>> With the latest devel version of GenomicRanges (1.13.48, should become
>> available via biocLite() in the next 24 hours or so), setting the
>> seqinfo of a big VCF object should be much faster but your mileage
>> may vary. Let us know if you still find it slow enough to justify a
>> mechanism for letting the user specify the seqinfo upfront when using
>> readVcf() (e.g. the 'genome' arg could be renamed 'seqinfo' and accept
>> everything it supports now plus a Seqinfo object).
>>
>> Cheers,
>> H.
>>
>> On 09/24/2013 10:00 AM, Julian Gehring wrote:
>>> Hi Valerie,
>>>
>>>> Thanks for clarifying. No, there is not currently a way to do this.
>>>>
>>>> The 'seqinfo' on the rowData(vcf) should not be difficult to change.
>>>> Can
>>>> you provide more detail as to (1) why you are changing it (did
>>>> readVcf()
>>>> import something incorrectly or ?) and (2) what operations on the
>>>> 'seqinfo' are taking a long time.
>>>
>>> 'readVcf' did everything in a correct way.  I needed to add the
>>> information about the genome, circularity, and seqlengths based on
>>> external annotitation, since it is not part of the VCF file.
>>>
>>> I can't tell you at the moment where 'seqinfo <-' spends most of its
>>> time.  I'll profile it and get back to you.
>>>
>>> Thank your for your quick answer and your help!
>>>
>>> Best wishes
>>> Julian
>>>
>>>
>>>> Thanks.
>>>> Valerie
>>>>
>>>>>
>>>>> Best wishes
>>>>> Julian
>>>>>
>>>>>
>>>>> On 09/24/2013 06:31 PM, Valerie Obenchain wrote:
>>>>>> Hi Julian,
>>>>>>
>>>>>> On 09/24/2013 02:29 AM, Julian Gehring wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is there a direct way to specifiy the 'seqinfo' of a genome for the
>>>>>>> import of a VCF file using 'readVcf'?
>>>>>>
>>>>>> I think the question is how to read in a subset of
>>>>>> chromosomes/positions
>>>>>> from a vcf file without an accompanying tabix index. You can't.
>>>>>> readVcf() requires an index when subsets are defined by
>>>>>> chromosome/position. However you can read in subsets defined by INFO
>>>>>> and/or GENO fields without an index.
>>>>>>
>>>>>> Approaches:
>>>>>> (1) create index with ?indexTabix and specify 'which' in ScanVcfParam
>>>>>> (2) use ?filterVcf to write out a new file of records of interest
>>>>>>
>>>>>>> I'm aware that one can change it
>>>>>>> with the 'seqinfo' method afterwards, but for large VCF files this
>>>>>>> can
>>>>>>> take a significant amount of time.
>>>>>>
>>>>>> What operation is taking along time? Subsetting the VCF object by
>>>>>> chromosome?
>>>>>>
>>>>>> Valerie
>>>>>>
>>>>>>>
>>>>>>> An alternative would be to sneak it in by the 'which' arguments,
>>>>>>> such
>>>>>>> as:
>>>>>>>
>>>>>>> readVcf(file, genome, ScanVcfParam(which = as(seq_info, "GRanges")))
>>>>>>>
>>>>>>> but this requires the file to be indexed beforehand.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Julian
>>>>>>>
>>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>