[BioC] writeVcf bug

Richard Pearson rpearson at well.ox.ac.uk
Fri Apr 19 10:59:49 CEST 2013


Hi Valerie

That's great, thanks. Will check out the new code when available.

Sorry, I worded this poorly. I simply meant that the problem I had 
previously seen with the example code ("Error in row(genoMat)") had gone 
away, but I had noticed a new problem with this same code. There are no 
other problems that I am aware of that haven't been addressed.

Best wishes

Richard


On 18/04/2013 23:03, Valerie Obenchain wrote:
> Hi Richard,
>
> Thanks for reporting this. I've gone back and completely rewritten the 
> code that parses/packages the geno data for writeVcf(). I didn't do a 
> good job with this the first time around and it's now much cleaner and 
> streamlined. Patched versions are 1.6.3 and 1.7.8 and should be 
> available via biocLite() tomorrow (Friday) after ~9am PST.
>
> You mentioned you're 'still' having problems with writeVcf(). Are 
> there other problems that haven't been addressed (fixed) yet?
>
> Valerie
>
> On 04/18/2013 03:55 AM, Richard Pearson wrote:
>> Hi
>>
>> I've recently upgraded to R 3.0.0, but am still having a (now slightly
>> different) problem with writeVcf. When there is only one element in
>> geno(vcf), each variant gets a separate row for each sample. For
>> example, in the following code, out1.vcf has 5 variants, each with
>> genotypes for 3 samples, but out2.vcf has 15 variants, each with
>> genotypes for just one sample.
>>
>>      library(VariantAnnotation)
>>      fl <- system.file("extdata", "ex2.vcf", 
>> package="VariantAnnotation")
>>      in1 <- readVcf(fl, "hg19")
>>      writeVcf(in1, "out1.vcf")
>>      in2 <- readVcf(fl, "hg19", ScanVcfParam(geno="GT"))
>>      writeVcf(in2, "out2.vcf")
>>
>>  > sessionInfo()
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>   [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C LC_TIME=en_GB.UTF-8
>> LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
>> LC_PAPER=C                 LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets 
>> methods base
>>
>> other attached packages:
>> [1] VariantAnnotation_1.6.2 Rsamtools_1.12.1 Biostrings_2.28.0
>> GenomicRanges_1.12.2 IRanges_1.18.0          BiocGenerics_0.6.0
>> devtools_1.2            BiocInstaller_1.10.0
>>
>> loaded via a namespace (and not attached):
>>   [1] AnnotationDbi_1.22.1   Biobase_2.20.0 biomaRt_2.16.0
>> bitops_1.0-5           BSgenome_1.28.0 DBI_0.2-5 digest_0.6.3
>> evaluate_0.4.3 GenomicFeatures_1.12.0 httr_0.2
>> [11] memoise_0.1            RCurl_1.95-4.1 RSQLite_0.11.3
>> rtracklayer_1.20.1     stats4_3.0.0 stringr_0.6.2 tools_3.0.0
>> whisker_0.1 XML_3.96-1.1           zlibbioc_1.6.0
>>
>> Cheers
>>
>> Richard
>>
>>
>> On 29/11/2012 17:36, Richard Pearson wrote:
>>> Hi Michael
>>>
>>> My devel sessionInfo is below. It seems biocLite has given me version
>>> 1.5.17of the package, but I see that .makeVcfGenolooks different in the
>>> latest from SVN (1.5.19) - so I'm sure you're right and I just need to
>>> wait for things to filter through. Apologies for the noise.
>>>
>>> Every time I upgrade VariantAnnotation it is an even better package 
>>> than
>>> before - many thanks for all the hard work!
>>>
>>> Richard
>>>
>>>
>>>   > sessionInfo()
>>> R Under development (unstable) (2012-11-27 r61172)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>    [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C LC_TIME=en_GB.UTF-8
>>> LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8
>>> LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C                 LC_NAME=C
>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils datasets methods
>>> base
>>>
>>> other attached packages:
>>> [1] VariantAnnotation_1.5.17 Rsamtools_1.11.11 Biostrings_2.27.7
>>> GenomicRanges_1.11.15 IRanges_1.17.15 BiocGenerics_0.5.1
>>> devtools_0.8             BiocInstaller_1.9.4
>>>
>>> loaded via a namespace (and not attached):
>>>    [1] AnnotationDbi_1.21.7   Biobase_2.19.1 biomaRt_2.15.0
>>> bitops_1.0-5 BSgenome_1.27.1        DBI_0.2-5 digest_0.6.0
>>> evaluate_0.4.2 GenomicFeatures_1.11.5 httr_0.2
>>> [11] memoise_0.1            plyr_1.7.1 RCurl_1.95-3
>>> RSQLite_0.11.2 rtracklayer_1.19.6     stats4_2.16.0
>>> stringr_0.6.1          tools_2.16.0 whisker_0.1 XML_3.95-0.1
>>> [21] zlibbioc_1.5.0
>>>
>>>
>>> On 29/11/2012 17:22, Michael Lawrence wrote:
>>>> I am pretty sure this bug has been fixed in devel. Would you mind
>>>> pasting your devel session info?
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>>
>>>> On Thu, Nov 29, 2012 at 8:44 AM, Richard Pearson
>>>> <rpearson at well.ox.ac.uk <mailto:rpearson at well.ox.ac.uk>> wrote:
>>>>
>>>>      Hi
>>>>
>>>>      I've noticed a bug with writeVcf when trying to write out
>>>>      relatively simple vcf files, e.g. a file containing only GT in 
>>>> the
>>>>      FORMAT fields. The following is a (hopefully) reproducible 
>>>> example
>>>>      based on examples in ?writeVcf
>>>>
>>>>      library(VariantAnnotation)
>>>>      fl <- system.file("extdata", "ex2.vcf",
>>>> package="VariantAnnotation")
>>>>      out1.vcf <- tempfile()
>>>>      out2.vcf <- tempfile()
>>>>      in1 <- readVcf(fl, "hg19")
>>>>      writeVcf(in1, out1.vcf)
>>>>      in2 <- readVcf(fl, "hg19", ScanVcfParam(geno="GT"))
>>>>      writeVcf(in2, out2.vcf)
>>>>
>>>>      This last line gives me:
>>>>
>>>>      Error in row(genoMat) :
>>>>        a matrix-like object is required as argument to 'row'
>>>>
>>>>      Looking at the following code snippet from .makeVcfGeno, it seems
>>>>      genoMat is first getting created as a matrix, but then changed to
>>>>      something that is not a matrix by "genoMat <- genoMatFlat". The
>>>>      error then comes from the "row(genoMat)" on the last line 
>>>> shown. I
>>>>      don't really understand what is going on in this code so can't
>>>>      offer a patch I'm afraid.
>>>>
>>>>          genoMat <- matrix(unlist(as.list(geno), use.names = FALSE,
>>>>              recursive = FALSE), nsub * nrec, length(geno))
>>>>          genoMatFlat <- as.character(unlist(genoMat))
>>>>          genoMatFlat[is.na <http://is.na>(genoMatFlat)] <- "."
>>>>          if (is.list(genoMat)) {
>>>>              genoMatList <- relist(genoMatFlat,
>>>> PartitioningByEnd(genoMat))
>>>>              genoMatFlat <- .pasteCollapse(genoMatList, ",")
>>>>              genoMat <- matrix(genoMatFlat, nrow(genoMat),
>>>> ncol(genoMat))
>>>>          }
>>>>          else genoMat <- genoMatFlat
>>>>          formatMatPerSub <- matrix(rep(t(formatMat), nsub), nsub *
>>>>              nrec, length(geno), byrow = TRUE)
>>>>          keep <- !is.na <http://is.na>(formatMatPerSub)
>>>>          genoListBySub <- seqsplit(genoMat[keep], row(genoMat)[keep])
>>>>
>>>>      FWIW, as a temporary fix I made an extra, slightly more complex
>>>>      FORMAT field using the following:
>>>>
>>>>      geno(in2)[["DUMMY"]] <-
>>>>      matrix(sapply(as.vector(geno(in2)[["GT"]]), function(x) list(c(x,
>>>>      x))), ncol=ncol(geno(in2)[["GT"]]),
>>>>      dimnames=dimnames(geno(in2)[["GT"]]))
>>>>      writeVcf(in2, out2.vcf)
>>>>
>>>>      I've checked and it seems this bug is also present in latest
>>>>      devel. Any chance of a fix?
>>>>
>>>>      Thanks
>>>>
>>>>      Richard
>>>>
>>>>      > sessionInfo()
>>>>      R version 2.15.1 (2012-06-22)
>>>>      Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>>      locale:
>>>>       [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C LC_TIME=en_GB.UTF-8
>>>>           LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8
>>>>       LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C
>>>>       [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>      LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>>      attached base packages:
>>>>      [1] stats     graphics  grDevices utils     datasets methods base
>>>>
>>>>      other attached packages:
>>>>      [1] VariantAnnotation_1.4.5 Rsamtools_1.10.2 Biostrings_2.26.2
>>>>        GenomicRanges_1.10.5 IRanges_1.16.4 BiocGenerics_0.4.0
>>>>      devtools_0.8  BiocInstaller_1.8.3
>>>>
>>>>      loaded via a namespace (and not attached):
>>>>       [1] AnnotationDbi_1.20.3   Biobase_2.18.0 biomaRt_2.14.0
>>>>      bitops_1.0-5 BSgenome_1.26.1        DBI_0.2-5 digest_0.6.0
>>>>        evaluate_0.4.2 GenomicFeatures_1.10.1 httr_0.2
>>>>      [11] memoise_0.1            parallel_2.15.1 plyr_1.7.1
>>>>      RCurl_1.95-3 RSQLite_0.11.2         rtracklayer_1.18.1
>>>>      stats4_2.15.1          stringr_0.6.1 tools_2.15.1 whisker_0.1
>>>>      [21] XML_3.95-0.1           zlibbioc_1.4.0
>>>>
>>>>      _______________________________________________
>>>>      Bioconductor mailing list
>>>>      Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>>>      https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>      Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>



More information about the Bioconductor mailing list