[BioC] VCF class: different length when unlisting INFO CompressedCharacterList

Valerie Obenchain vobencha at fhcrc.org
Tue May 27 23:06:30 CEST 2014


Hi,

VariantAnnotation doesn't have a function that expands the data 'by 
sample' as you've described but there are several helpers that can make 
the process more straight forward.

(1) filterVcf()

This function filters records from a vcf and outputs another vcf on 
disk. You can use this to create a subset file of records with 
FILTER=PASS. The function can't be used to remove records with GT='.' 
because of the structure of the vcf file. The filtering removes variants 
(rows) that don't meet the criteria. In the case of the GT filtering you 
are removing select variants per sample so this is better done at a 
later stage.

Filtering by the presence/absence of a value can be done with a pre-filter:

## sample data
fl <- system.file("extdata", "chr7-sub.vcf.gz",
                   package="VariantAnnotation")

## pre-filter for 'PASS':
pass_filter = function(x) grepl("PASS", x, fixed=TRUE)
PF <- FilterRules(list(pass_filter))

## here the filtered vcf is sent to a temp file:
filtered_vcf <- filterVcf(fl, "hg19", tempfile(), prefilters=PF)

(2) ScanVcfParam

You may have already discovered that a ScanVcfParam object specifies 
subsets of data to be read in. Using a 'param' can speed up readVcf() 
and makes for a lighter VCF object in R.

## read in specific genotype (FORMAT) fields and no INFO data:
param <- ScanVcfParam(info=NA, geno=c("GT", "CGA_OAD", "CGA_ORDP"))
vcf <- readVcf(filtered_vcf, "hg19", param=param)

>> geno(vcf)
> List of length 3
> names(3): GT CGA_OAD CGA_ORDP

The file has 2 samples:
>> geno(vcf)$GT[1:4,]
>                 HCC1187-H-200-37-ASM-N1 HCC1187-H-200-37-ASM-T1
> 7:55003988_A/G  "1/1"                   "1/1"
> 7:55010562_GA/G "1/1"                   "1/1"
> 7:55012097_T/C  "1/0"                   "1/0"
> 7:55015505_C/T  "1/0"                   "1/0"

More below.

On 05/26/2014 06:15 AM, Sigve Nakken wrote:
> Hi,
>
> I’ve had similar challenges as Francesco, and have unsuccessfully tried to use the data structures and functions provided by VariantAnnotation. My experience is that I need the ‘expand' functionality more often with respect to the samples rather than the annotation tags. And as my experience with R is somewhat limited, I tend to like to work with simple data.frames when I want to summarise and characterise the data.
>
> Here is how I generated a simple data frame with all sample variants (variants and samples are dummy encoded):
>
> ## read VCF (100 samples)
>> all_vcf <- readVcf(‘cancer.exome.project.vcf.gz','hg19')
>> seqinfo(all_vcf) <- seqinfo(BSgenome.Hsapiens.UCSC.hg19)
>> rowData(all_vcf)
> GRanges with 50383 ranges and 5 metadata columns:
>>
> ## get variants with ‘PASS’ filter
>> all_vcf_PASS <- all_vcf[str_detect(elementMetadata(all_vcf)$FILTER,"PASS"),]
>> rowData(all_vcf_PASS)
> GRanges with 4252 ranges and 5 metadata columns:
> ...
>
>
> ## get genotype information for all samples that have called genotypes (GT != ‘.’)
> ## From my VCF:
> ## FORMAT=<ID=ADT,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed (tumor)">
> ## FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
> ## FORMAT=<ID=ADC,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed (control)">
> ##
>> tumor_ref_support <- t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[1,])
>> normal_ref_support <- t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[1,])
>> tumor_alt_support <- t(as.data.frame(geno(all_vcf_PASS)$ADT[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[2,])
>> normal_alt_support <- t(as.data.frame(geno(all_vcf_PASS)$ADC[which(geno(all_vcf_PASS)$GT != '.', arr.ind=T)],row.names=NULL)[2,])

I don't think it's necessary (or efficient) to subset the data on GT = 
'.'. The ADC and ADT values should be NA if GT was '.' and the 
subsequent math operations (i.e., '+') will produce NAs but no errors. 
If you really want to remove them you can use is.na(geno(all_vcf_PASS)$ADT.

>
> ## get sample ids and variant ids for the ‘expanded set of variants'
>> b <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$col
>> c <- as.data.frame(which(geno(all_vcf_PASS)$GT != '.', arr.ind=T))$row
>> sample_ids <- as.data.frame(rownames(colData(all_vcf_PASS))[b])
>> variant_ids <- as.data.frame(rownames(geno(all_vcf_PASS)$GT)[c])
>
>
> ## make simple data.frame of all samples with genotype information
>> row.names(tumor_ref_support) <- NULL
>> row.names(normal_ref_support) <- NULL
>> row.names(tumor_alt_support) <- NULL
>> row.names(external_passed) <- NULL
>> row.names(normal_alt_support) <- NULL
>
>> tmp <- as.data.frame(cbind(variant_ids,sample_ids,tumor_ref_support,tumor_alt_support,normal_ref_support,normal_alt_support))
>> colnames(tmp) <- c('variant_id','sample_id','tumor_ref_support','tumor_alt_support','normal_ref_support','normal_alt_support')
>> tmp$allele_frac_tumor <- rep(0,nrow(tmp))
>> tmp$tumor_depth <- tmp$tumor_ref_support + tmp$tumor_alt_support
>> tmp$normal_depth <- tmp$normal_ref_support + tmp$normal_alt_support
>> tmp[tmp$tumor_depth > 0,]$allele_frac_tumor <- round(tmp[tmp$tumor_depth > 0,]$tumor_alt_support / tmp[tmp$tumor_depth > 0,]$tumor_depth,2)

Because the geno variables are stored in matrices you can add them 'as 
is'. Underneath arrays are just vectors with a dimension attached. This 
operation on my sample variable doesn't make biological sense but it's 
the idea that you can just add the arrays of 'tumor_ref_support' and 
'tumor_alt_support'.

## bogus new variable
new_var <- geno(vcf)$CGA_OAD + geno(vcf)$CGA_OAD

The result is another array with the same dimensions:

>> dim(geno(vcf)$CGA_OAD)
> [1] 1184    2    2

>> dim(new_var)
> [1] 1184    2    2

Isolate the samples to separate array elements by rearranging the array 
with aperm().

ss <- aperm(new_var, c(1, 3, 2))
array_dims <- dim(ss)
>> array_dims
> [1] 1184    2    2

Create vectors of variant and sample names:
variant_id <- rep.int(dimnames(vcf)[[1]], array_dims[2] * array_dims[3])
sample_id <- rep(dimnames(vcf)[[2]], each=array_dims[1] * array_dims[2])

Create the data.frame:
df <- data.frame(variant_id, sample_id, new_var=as.vector(ss))
>> head(df)
>        variant_id               sample_id new_var
> 1  7:55003988_A/G HCC1187-H-200-37-ASM-N1     168
> 2 7:55010562_GA/G HCC1187-H-200-37-ASM-N1     100
> 3  7:55012097_T/C HCC1187-H-200-37-ASM-N1      64
> 4  7:55015505_C/T HCC1187-H-200-37-ASM-N1      38
> 5  7:55015541_A/. HCC1187-H-200-37-ASM-N1      NA
> 6  7:55015699_C/T HCC1187-H-200-37-ASM-N1      20

You might also take a look at the VRanges class. It's a flat/light form 
of vcf data necessary for variant calling and filtering. It doesn't 
expand the genotype data as per this example but it may be useful to you 
in other applications. See ?VRanges for details.


Valerie

>
> ##
>> head(tmp)
>            variant_id sample_id tumor_ref_support tumor_alt_support normal_ref_support normal_alt_support allele_frac_tumor tumor_depth normal_depth
> 1 chr5_10000000_T_C 001B_001T                17                 7                 21                  0              0.29          24           21
> 2  chr11_10000000_C_T 001B_001T                31                 4                 33                  0              0.11          35           33
> 3 chr18_10000000_C_T 001B_001T                21                 7                 37                  1              0.25          28           38
> 4   chrY_1000000_A_G 001B_001T                10                 2                 11                  0              0.17          12           11
> 5  chr1_1000000_G_C 011B_011T                17                 3                 48                  0              0.15          20           48
> 6  chr1_1000000_A_G 011B_011T                77                13                114                  0              0.14          90          114
>
>> str(tmp)
> 'data.frame':	4418 obs. of  9 variables:
>   $ variant_id        : Factor w/ 4252 levels "chr1_1000000_G_T",..: 3254 620 1644 4251 359 381 391 401 246 1875 ...
>   $ sample_id         : Factor w/ 99 levels "001B_001T","011B_011T",..: 1 1 1 1 2 2 2 2 2 2 ...
>   $ tumor_ref_support : int  17 31 21 10 17 77 12 40 75 53 ...
>   $ tumor_alt_support : int  7 4 7 2 3 13 3 6 8 9 ...
>   $ normal_ref_support: int  21 33 37 11 48 114 19 52 88 89 ...
>   $ normal_alt_support: int  0 0 1 0 0 0 0 0 0 0 ...
>   $ allele_frac_tumor : num  0.29 0.11 0.25 0.17 0.15 0.14 0.2 0.13 0.1 0.15 ...
>   $ tumor_depth       : int  24 35 28 12 20 90 15 46 83 62 ...
>   $ normal_depth      : int  21 33 38 11 48 114 19 52 88 89 ...
>
> Next, I plan to do a merge with my functional annotations (info) using the variant_id as the key, which I think would be straightforward.
>
> If there is a more convenient way to get here using the VariantAnnotation package, I would be grateful to hear about this.
>
> ---
> Sigve Nakken, PhD
> Postdoctoral Fellow, Dept. of Tumor Biology
> Institute for Cancer Research
> Oslo University Hospital, Norway
> phone: +4795753022
> email: sigven at ifi.uio.no
>
>
>
>
>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fhcrc.org
Phone: (206) 667-3158



More information about the Bioconductor mailing list