[BioC] readVcf - ALT allele as CompressedCharacterList instead of DNAStringSetList

Fri Oct 5 19:06:50 CEST 2012

We don't currently have a function that does this. Maybe this should be 
a feature request since files with a mixture of structural and 
non-structural variants are becoming more common. In the meantime you 
can use some internal functions in VariantAnnotation. This toy example 
isn't great because it only has one non-structural variant but the same 
idea would apply to a file with many.

fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
 > dim(vcf)
[1] 7 1
 > alt(vcf)
CompressedCharacterList of length 7
[[1]] C
[[2]] <DEL>
[[3]] <DEL>
[[4]] <DEL:ME:ALU>
[[5]] <INS:ME:L1>
[[6]] <DUP>
[[7]] <DUP:TANDEM>

## Identify the structural variants and remove them:

structural <- logical(length(alt(vcf)))
structural[grep("<", unlist(alt(vcf)), fixed=TRUE)] <- TRUE
vcf2 <- vcf[!structural, ]
 > alt(vcf2)
CompressedCharacterList of length 1
[[1]] C

## Convert the alleles to a DNAStringSetList using the internal function 
.toDNAStringSetList().

alt(vcf2) <- VariantAnnotation:::.toDNAStringSetList(unlist(alt(vcf2), 
use.names=FALSE))
 > alt(vcf2)
DNAStringSetList of length 1
[[1]] C

## Rename seqlevels and use the new vcf in predictCoding().

library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
newnames <- paste("chr", seqlevels(vcf2), sep="")
names(newnames) <- seqlevels(vcf2)
vcf2 <- renameSeqlevels(vcf2, newnames)
 > intersect(seqlevels(vcf2), seqlevels(txdb))
[1] "chr1" "chr2" "chr3" "chr4"

## Again, not an exciting example because this variant does not fall in 
a coding region so our result is empty.
predictCoding(vcf2, txdb, Hsapiens)
GRanges with 0 ranges and 1 metadata column:
    seqnames    ranges strand |      GENEID
<Rle> <IRanges> <Rle> | <character>
   ---
   seqlengths:

If you think this would be a worthy feature/function request let me know.

Valerie

On 10/04/2012 03:36 AM, Lescai, Francesco wrote:
> Oh yes,
> [66991] "<UNASSEMBLED_EVENT>"
> [71698] "<UNASSEMBLED_EVENT>"
> [98080] "<UNASSEMBLED_EVENT>"
>
> what can I do in these cases to translate the variants? maybe excluding SVs and reverting the ALT to a DNAStringSet?
> in the next future will will have more and more SVs encoded into VCFs because that's one of the most important features of the new HaplotypeCaller within GATK.
>
> thanks very much,
> Francesco
>
>
> On 3 Oct 2012, at 17:54, Valerie Obenchain<vobencha at fhcrc.org<mailto:vobencha at fhcrc.org>>  wrote:
>
> Hi Francesco,
>
> Yes, the DNAStringSetList is intended to handle multiple alternate alleles. ALT should only be made a CompressedList when structural variants are present.
>
> Try the following with your file,
>
> debug(VariantAnnotation:::.formatALT)
> fl<- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> readVcf(fl, "hg19")
> debug: {
>     if (is.null(x))
>         return(NULL)
>     structural<- grep("<", x, fixed = TRUE)
>     if (!identical(integer(0), structural))
>         seqsplit(x, seq_len(length(x)))
>     else .toDNAStringSetList(x)
> }
>
> Here 'x' is the alternate allele values,
>
> Browse[2]>  x
> [1] "A"      "A"      "G,T"    "."      "G,GTCT"
>
> My guess is that if you inspect these you'll see at least one structural variant in there.
>
> Valerie
>
> On 10/03/2012 07:23 AM, Lescai, Francesco wrote:
> Hi there,
> I was importing a new VCF file generated with the GATK v2 HaplotypeCaller, to compare several parameters with a previous one on the same samples generated with UnifiedGenotyper.
>
> "thanks" to an error in predictCoding, I discovered the new VCF file formatted the ALT allele field as a CompressedCharacterList instead of a DNAStringSetList.
>
> see here
>
> head(fixed(haplo),n=2L)
> GRanges with 2 ranges and 5 metadata columns:
>                seqnames           ranges strand | paramRangeID            REF
>                   <Rle>          <IRanges>    <Rle>   |<factor>   <DNAStringSet>
>    rs139570490        1 [865738, 865738]      * |<NA>                A
>      rs4372192        1 [876499, 876499]      * |<NA>                A
>                                      ALT      QUAL      FILTER
>                <CompressedCharacterList>   <numeric>   <character>
>    rs139570490                         G   4280.02        PASS
>      rs4372192                         G  11733.02        PASS
>    ---
>    seqlengths:
>      1 10 11 12 13 14 15 16 17 18 19  2 20 21 22  3  4  5  6  7  8  9  X  Y
>     NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
>
> head(fixed(uni),n=2L)
> GRanges with 2 ranges and 5 metadata columns:
>              seqnames           ranges strand | paramRangeID            REF
>                 <Rle>          <IRanges>    <Rle>   |<factor>   <DNAStringSet>
>     1:865738        1 [865738, 865738]      * |<NA>                A
>    rs4372192        1 [876499, 876499]      * |<NA>                A
>                             ALT      QUAL      FILTER
>              <DNAStringSetList>   <numeric>   <character>
>     1:865738           ########   5055.97        PASS
>    rs4372192           ########  13526.97        PASS
>    ---
>    seqlengths:
>      1 10 11 12 13 14 15 16 17 18 19  2 20 21 22  3  4  5  6  7  8  9  X  Y
>     NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
>
> The only very important difference I could find between the two files, is that in the one generated with UnifiedGenotyper there is one single allele in the ALT field, while in the new VCF file, there are several variants with multiple alternative alleles
>
> [me at wilder test]$ grep -v "##" union.filtered.vcf | cut -f 1,2,3,4,5 | grep ","
> [me at wilder test]$ grep -v "##" haploC.filtered.both.vcf | cut -f 1,2,3,4,5 | grep ","
> 1 1247578 . T G,TGG
> 1 1920434 rs144487103 TAA TA,TAAA
> 1 2303347 rs60972860 CAA CA,CAAA
> 1 3753032 rs36051675 GTT GTTTTT,GTTTTTTTT
> 1 6291918 rs148639379 TA TAA,T
> 1 7913184 rs34962249 CAAAA CAAA,CAAAAA
> 1 10357206 rs58344165 ATTT ATT,ATTTT,ATTTTT
> 1 14057425 rs35643856 GA GAA,G
> 1 15654737 rs71000392 CT CTT,C
> [...cut...]
>
> but I thought a DNAStringSetList was meant to accommodate precisely the case where you have multiple alternative alleles, wasn't it?
>
> thanks for your help,
> Francesco
>
>
>
> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
>   [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
>   [5] LC_MONETARY=en_US.iso885915    LC_MESSAGES=en_US.iso885915
>   [7] LC_PAPER=C                     LC_NAME=C
>   [9] LC_ADDRESS=C                   LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] VariantAnnotation_1.3.32 Rsamtools_1.9.31         Biostrings_2.25.12
> [4] GenomicRanges_1.9.66     IRanges_1.15.48          BiocGenerics_0.3.2
>
> loaded via a namespace (and not attached):
>   [1] AnnotationDbi_1.19.46  Biobase_2.17.8         biomaRt_2.13.2
>   [4] bitops_1.0-4.1         BSgenome_1.25.9        colorspace_1.1-1
>   [7] DBI_0.2-5              dichromat_1.2-4        digest_0.5.2
> [10] GenomicFeatures_1.9.44 ggplot2_0.9.2.1        grid_2.15.0
> [13] gtable_0.1.1           labeling_0.1           MASS_7.3-21
> [16] memoise_0.1            munsell_0.4            parallel_2.15.0
> [19] plyr_1.7.1             proto_0.3-9.2          RColorBrewer_1.0-5
> [22] RCurl_1.91-1           reshape2_1.2.1         RSQLite_0.11.2
> [25] rtracklayer_1.17.21    scales_0.2.2           stats4_2.15.0
> [28] stringr_0.6.1          tools_2.15.0           XML_3.95-0.1
> [31] zlibbioc_1.3.0
>
>
> ---------------------------------------------------------------------------------
> Francesco Lescai, PhD, EDBT
> Senior Research Associate in Genome Analysis
> University College London
> Faculty of Population Health Sciences
> Dept. Genes, Development&   Disease
> ICH - Molecular Medicine Unit, GOSgene team
> 30 Guilford Street
> WC1N 1EH London UK
>
> email: f.lescai at ucl.ac.uk<mailto:f.lescai at ucl.ac.uk><mailto:f.lescai at ucl.ac.uk>
> phone: +44.(0)207.905.2274
> [ext: 2274]
> --------------------------------------------------------------------------------
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> ---------------------------------------------------------------------------------
> Francesco Lescai, PhD, EDBT
> Senior Research Associate in Genome Analysis
> University College London
> Faculty of Population Health Sciences
> Dept. Genes, Development&  Disease
> ICH - Molecular Medicine Unit, GOSgene team
> 30 Guilford Street
> WC1N 1EH London UK
>
> email: f.lescai at ucl.ac.uk<mailto:f.lescai at ucl.ac.uk>
> phone: +44.(0)207.905.2274
> [ext: 2274]
> --------------------------------------------------------------------------------
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor