[BioC] VCF class: different length when unlisting INFO CompressedCharacterList

Martin Morgan mtmorgan at fhcrc.org
Tue May 14 19:17:29 CEST 2013


Hi Francesco -- to add to what Valerie said...

On 05/14/2013 08:20 AM, Valerie Obenchain wrote:
> Hi Francesco,
>
> The expand,VCF-method was written for this purpose. Using expand() on a VCF will

I'm not sure, but I think expand,VCF-method was introduced more recently than 
the version of R / Bioconductor you have installed; maybe it's time to update?

...

> produce an object that is 'flattened' in the sense that the variant rows are
> repeated to match the unlisted ALT column. expand() will unlist ALT and any INFO
> or FORMAT variables that have one value per alternate allele which is indicated
> by 'Number=A' in the header. For example,
>
> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
>
>
> If you are working with a DataFrame, you can use expand() to specify exactly
> which columns you want 'flattened'.
>
>  > DF <- DataFrame(one=IntegerList(1:3, 4, 5),
>                    two=letters[1:3],
>                    three=CharacterList("A", c("B", "C"), "D"))
>  > expand(DF, colnames="three", keepEmptyRows=FALSE)
> DataFrame with 4 rows and 3 columns
>              one         two       three
>    <IntegerList> <character> <character>
> 1         1,2,3           a           A
> 2             4           b           B
> 3             4           b           C
> 4             5           c           D
>
> Details and examples are at,
> ?'VCF-class'  ## VCF method
> ?'expand'     ## DataFrame method
>
> I think this is what you were after ... let me know if this doesn't answer your
> question.
>
> Valerie
>
>
>
> On 05/14/13 01:09, Francesco Lescai wrote:
>> Hi all and Hi Valerie (I suppose),
>> I was extracting a field of the INFO column from a VCF, but when I unlist it I
>> get a different length compared the number of variants, so I don't know
>> anymore which refers to each variant.
>>
>> Here's what I'm doing
>>
>>> vcf
>> class: VCF
>> dim: 50273 30
>> genome: hg19
>> exptData(1): header
>> fixed(4): REF ALT QUAL FILTER
>> info(28): AC AF ... culprit set
>> geno(5): AD DP GQ GT PL
>> rownames(50273):
>> [.. cut for clarity ..]
>>
>> genotypes<-as.data.frame(geno(vcf)$GT)

Hmm, not sure that this is a good idea -- geno(vcf)$GT is a matrix, operations 
on a matrix can be much more efficient than on a data.frame, and the matrix 
conveys valuable information about data types, in particular that all elements 
are the same class.

>> dim(genotypes)
>> [1] 50273    30
>>
>> list.va<-info(vcf)$VA
>>> length(info(vcf)$VA)
>> [1] 50273
>>
>>> list.va
>> CompressedCharacterList of length 50273
>>
>> info.va<-unlist(info(vcf)$VA)
>>> length(info.va)
>> [1] 53391

it's worth understanding why this column is represented as a CharacterList (the 
listed form) rather than a simple character vector (the unlisted form). 
According to the INFO line from the VCF header Val mentions, some members of 
this field have more than one entry, specifically

   info(vcf)$VA[elementLengths(info(vcf)$VA != 1]

Here's a simple example

 > x = CharacterList(list(letters[1:2], LETTERS[1:3], character(0)))
 > x
CharacterList of length 2
[[1]] a b
[[2]] A B C
[[3]] character(0)
 > unlist(x)
[1] "a" "b" "A" "B" "C"

I guess you are perhaps hoping for

 > sapply(x, paste, collapse=",")
[1] "a,b"   "A,B,C" ""

but in some ways this isn't so convenient as it first appears -- now you'll have 
to split() these back out to work on them, and you've translated 'no data' 
(character(0)) to 'no characters' (""). One pattern that simplifies working with 
these CharacterList and similar objects is that they can be unlist'ed and 
relist'ed cheaply, so long as the total number of elements and their order don't 
change.

 >   y = toupper(unlist(x))
 >   relist(y, x)
CharacterList of length 2
[[1]] A B
[[2]] A B C
[[3]] character(0)

(actually, toupper(x) works without unlisting so this isn't a perfect example ...)

Martin

>>
>> This is an annotation from Variant Annotation Tool, which modifies the VCF Info.
>> But if I do the same for other more "standard" fields, some of them have the
>> same length of the variants, others don't when unlisted
>>
>>> length(unlist(info(vcf)$HaplotypeScore))
>> [1] 50273
>>> length(unlist(info(vcf)$AC))
>> [1] 50489
>>> length(unlist(info(vcf)$AF))
>> [1] 50489
>>
>> am I doing something wrong? or is the unlist method on the
>> CompressedCharacterList splitting on some field delimiter?
>>
>> below my sessionInfo.
>> thanks for any help you might provide,
>> cheers,
>> Francesco
>>
>>
>>> sessionInfo()
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>> locale:
>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>>   [1] reshape_0.8.4            plyr_1.8
>> ggbio_1.6.6              ggplot2_0.9.3.1          VariantAnnotation_1.4.12
>> Rsamtools_1.10.2
>>   [7] Biostrings_2.26.3        GenomicRanges_1.10.7
>> IRanges_1.16.6           BiocGenerics_0.4.0
>>
>> loaded via a namespace (and not attached):
>>   [1] AnnotationDbi_1.20.7   Biobase_2.18.0         biomaRt_2.14.0
>> biovizBase_1.6.2       bitops_1.0-4.2         BSgenome_1.26.1
>> cluster_1.14.4
>>   [8] colorspace_1.2-1       DBI_0.2-5              dichromat_2.0-0
>> digest_0.6.3           GenomicFeatures_1.10.2 grid_2.15.1
>> gridExtra_0.9.1
>> [15] gtable_0.1.2           Hmisc_3.10-1           labeling_0.1
>> lattice_0.20-15        MASS_7.3-23            munsell_0.4
>> parallel_2.15.1
>> [22] proto_0.3-10           RColorBrewer_1.0-5     RCurl_1.95-4.1
>> reshape2_1.2.2         RSQLite_0.11.2         rtracklayer_1.18.2     scales_0.2.3
>> [29] stats4_2.15.1          stringr_0.6.2          tools_2.15.1
>> XML_3.96-1.1           zlibbioc_1.4.0
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list