[BioC] VariantAnnotation: Specifying 'seqinfo' at import with 'readVcf'

Sun Nov 3 21:41:13 CET 2013

Hi Michael,

As a minimal example, I'll try to import the non-coding COSMIC VCF 
(ftp://ngs.sanger.ac.uk/production/cosmic/CosmicNonCodingVariants_v67_20131024.vcf.gz) 
with a NCBI 37 seqinfo object.

#+BEGIN_SRC R

 > seq_info
Seqinfo of length 25
seqnames seqlengths isCircular genome
1         249250621      FALSE GRCh37
2         243199373      FALSE GRCh37
3         198022430      FALSE GRCh37
4         191154276      FALSE GRCh37
5         180915260      FALSE GRCh37
...             ...        ...    ...
21         48129895      FALSE GRCh37
22         51304566      FALSE GRCh37
X         155270560      FALSE GRCh37
Y          59373566      FALSE GRCh37
MT            16569       TRUE GRCh37

 > x = readVcf("CosmicNonCodingVariants_v67_20131024.vcf.gz", seq_info)
Warning in .bcfHeaderAsSimpleList(header) :
   duplicate keys in header will be forced to unique rownames
Error in makeNewSeqnames(x, new2old = new2old, seqlevels(value)) :
   when 'new2old' is NULL, the first elements in the
   supplied 'seqlevels' must be identical to 'seqlevels(x)'

 > traceback()

8: stop("when 'new2old' is NULL, the first elements in the\n", " 
supplied 'seqlevels' must be identical to 'seqlevels(x)'")
7: makeNewSeqnames(x, new2old = new2old, seqlevels(value))
6: `seqinfo<-`(`*tmp*`, value = <S4 object of class "Seqinfo">)
5: `seqinfo<-`(`*tmp*`, value = <S4 object of class "Seqinfo">)
4: .scanVcfToVCF(scanVcf(file, param = param), file, genome, param)
3: .readVcf(file, genome, param = ScanVcfParam())
2: readVcf("CosmicNonCodingVariants_v67_20131024.vcf.gz", seq_info)
1: readVcf("CosmicNonCodingVariants_v67_20131024.vcf.gz", seq_info)

#+END_SRC

Please let me know if you need any more information.

Best wishes
Julian

On 11/03/2013 07:12 PM, Michael Lawrence wrote:
> This was my doing. I guess merge,Seqinfo,Seqinfo does not support
> different orderings? Would help to see the actual error message and
> stack trace. Any advice Herve?
>
> Michael
>
>
> On Sun, Nov 3, 2013 at 6:55 AM, Julian Gehring <julian.gehring at embl.de
> <mailto:julian.gehring at embl.de>> wrote:
>
>     Hi Valerie and Herve,
>
>     When providing a 'Seqinfo' object as the 'genome' argument to
>     'readVcf', would it be possible to not have the requirement of the
>     VCF header and the seqinfo object to have the same ordering?  As the
>     moment (VariantAnnotation_1.8.2), the import will fail if this is
>     not the case.  However, this requires the user to know the ordering
>     of the seqnames in the VCF file beforehand.  In the ideal case, the
>     readVcf function could reorder the seqnames in the way as provided
>     by the 'genome' argument.
>
>
>     Best wishes
>     Julian
>
>
>     On 09/28/2013 12:57 AM, Hervé Pagès wrote:
>
>         Hi Julian,
>
>         With the latest devel version of GenomicRanges (1.13.48, should
>         become
>         available via biocLite() in the next 24 hours or so), setting the
>         seqinfo of a big VCF object should be much faster but your mileage
>         may vary. Let us know if you still find it slow enough to justify a
>         mechanism for letting the user specify the seqinfo upfront when
>         using
>         readVcf() (e.g. the 'genome' arg could be renamed 'seqinfo' and
>         accept
>         everything it supports now plus a Seqinfo object).
>
>         Cheers,
>         H.
>
>         On 09/24/2013 10:00 AM, Julian Gehring wrote:
>
>             Hi Valerie,
>
>                 Thanks for clarifying. No, there is not currently a way
>                 to do this.
>
>                 The 'seqinfo' on the rowData(vcf) should not be
>                 difficult to change. Can
>                 you provide more detail as to (1) why you are changing
>                 it (did readVcf()
>                 import something incorrectly or ?) and (2) what
>                 operations on the
>                 'seqinfo' are taking a long time.
>
>
>             'readVcf' did everything in a correct way.  I needed to add the
>             information about the genome, circularity, and seqlengths
>             based on
>             external annotitation, since it is not part of the VCF file.
>
>             I can't tell you at the moment where 'seqinfo <-' spends
>             most of its
>             time.  I'll profile it and get back to you.
>
>             Thank your for your quick answer and your help!
>
>             Best wishes
>             Julian
>
>
>                 Thanks.
>                 Valerie
>
>
>                     Best wishes
>                     Julian
>
>
>                     On 09/24/2013 06:31 PM, Valerie Obenchain wrote:
>
>                         Hi Julian,
>
>                         On 09/24/2013 02:29 AM, Julian Gehring wrote:
>
>                             Hi,
>
>                             Is there a direct way to specifiy the
>                             'seqinfo' of a genome for the
>                             import of a VCF file using 'readVcf'?
>
>
>                         I think the question is how to read in a subset of
>                         chromosomes/positions
>                         from a vcf file without an accompanying tabix
>                         index. You can't.
>                         readVcf() requires an index when subsets are
>                         defined by
>                         chromosome/position. However you can read in
>                         subsets defined by INFO
>                         and/or GENO fields without an index.
>
>                         Approaches:
>                         (1) create index with ?indexTabix and specify
>                         'which' in ScanVcfParam
>                         (2) use ?filterVcf to write out a new file of
>                         records of interest
>
>                             I'm aware that one can change it
>                             with the 'seqinfo' method afterwards, but
>                             for large VCF files this
>                             can
>                             take a significant amount of time.
>
>
>                         What operation is taking along time? Subsetting
>                         the VCF object by
>                         chromosome?
>
>                         Valerie
>
>
>                             An alternative would be to sneak it in by
>                             the 'which' arguments, such
>                             as:
>
>                             readVcf(file, genome, ScanVcfParam(which =
>                             as(seq_info, "GRanges")))
>
>                             but this requires the file to be indexed
>                             beforehand.
>
>                             Best wishes
>                             Julian
>
>
>
>
>             _________________________________________________
>             Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>             https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>             Search the archives:
>             http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>     _________________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>     Search the archives:
>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>