[BioC] VCF class: different length when unlisting INFO CompressedCharacterList

Valerie Obenchain vobencha at fhcrc.org
Tue May 14 17:20:40 CEST 2013


Hi Francesco,

The expand,VCF-method was written for this purpose. Using expand() on a 
VCF will produce an object that is 'flattened' in the sense that the 
variant rows are repeated to match the unlisted ALT column. expand() 
will unlist ALT and any INFO or FORMAT variables that have one value per 
alternate allele which is indicated by 'Number=A' in the header. For 
example,

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">


If you are working with a DataFrame, you can use expand() to specify 
exactly which columns you want 'flattened'.

 > DF <- DataFrame(one=IntegerList(1:3, 4, 5),
                   two=letters[1:3],
                   three=CharacterList("A", c("B", "C"), "D"))
 > expand(DF, colnames="three", keepEmptyRows=FALSE)
DataFrame with 4 rows and 3 columns
             one         two       three
   <IntegerList> <character> <character>
1         1,2,3           a           A
2             4           b           B
3             4           b           C
4             5           c           D

Details and examples are at,
?'VCF-class'  ## VCF method
?'expand'     ## DataFrame method

I think this is what you were after ... let me know if this doesn't 
answer your question.

Valerie



On 05/14/13 01:09, Francesco Lescai wrote:
> Hi all and Hi Valerie (I suppose),
> I was extracting a field of the INFO column from a VCF, but when I unlist it I get a different length compared the number of variants, so I don't know anymore which refers to each variant.
>
> Here's what I'm doing
>
>> vcf
> class: VCF
> dim: 50273 30
> genome: hg19
> exptData(1): header
> fixed(4): REF ALT QUAL FILTER
> info(28): AC AF ... culprit set
> geno(5): AD DP GQ GT PL
> rownames(50273):
> [.. cut for clarity ..]
>
> genotypes<-as.data.frame(geno(vcf)$GT)
> dim(genotypes)
> [1] 50273    30
>
> list.va<-info(vcf)$VA
>> length(info(vcf)$VA)
> [1] 50273
>
>> list.va
> CompressedCharacterList of length 50273
>
> info.va<-unlist(info(vcf)$VA)
>> length(info.va)
> [1] 53391
>
> This is an annotation from Variant Annotation Tool, which modifies the VCF Info.
> But if I do the same for other more "standard" fields, some of them have the same length of the variants, others don't when unlisted
>
>> length(unlist(info(vcf)$HaplotypeScore))
> [1] 50273
>> length(unlist(info(vcf)$AC))
> [1] 50489
>> length(unlist(info(vcf)$AF))
> [1] 50489
>
> am I doing something wrong? or is the unlist method on the CompressedCharacterList splitting on some field delimiter?
>
> below my sessionInfo.
> thanks for any help you might provide,
> cheers,
> Francesco
>
>
>> sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] reshape_0.8.4            plyr_1.8                 ggbio_1.6.6              ggplot2_0.9.3.1          VariantAnnotation_1.4.12 Rsamtools_1.10.2
>   [7] Biostrings_2.26.3        GenomicRanges_1.10.7     IRanges_1.16.6           BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
>   [1] AnnotationDbi_1.20.7   Biobase_2.18.0         biomaRt_2.14.0         biovizBase_1.6.2       bitops_1.0-4.2         BSgenome_1.26.1        cluster_1.14.4
>   [8] colorspace_1.2-1       DBI_0.2-5              dichromat_2.0-0        digest_0.6.3           GenomicFeatures_1.10.2 grid_2.15.1            gridExtra_0.9.1
> [15] gtable_0.1.2           Hmisc_3.10-1           labeling_0.1           lattice_0.20-15        MASS_7.3-23            munsell_0.4            parallel_2.15.1
> [22] proto_0.3-10           RColorBrewer_1.0-5     RCurl_1.95-4.1         reshape2_1.2.2         RSQLite_0.11.2         rtracklayer_1.18.2     scales_0.2.3
> [29] stats4_2.15.1          stringr_0.6.2          tools_2.15.1           XML_3.96-1.1           zlibbioc_1.4.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list