[BioC] merging VCF files

Stephanie M. Gogarten sdmorris at u.washington.edu
Fri Feb 15 22:56:25 CET 2013


I don't think that this qualifies as a "problem," but I found that an 
extra step is needed (beyond my initial simple use case) to harmonize 
colData if the colnames of the VCFs have to be re-ordered.  From the man 
page for "VCF-class" it is not clear that "colData(x) <- value" is 
possible, but happily it works.

Here is what I did:

svp <- ScanVcfParam(geno="GT")
vcf1 <- readVcf(vcffile1, "hg19", svp)
vcf2 <- readVcf(vcffile2, "hg19", svp)

## vcf1 and vcf2 have the same samples, but in different order
vcf2 <- vcf2[,colnames(vcf1)]
vcf <- rbind(vcf1, vcf2)
## Error in FUN("Samples"[[1L]], ...) :
##   column(s) 'Samples' in ‘colData’ are duplicated and the data do not 
match

## colData is automatically created by readVcf with "Samples" ordered 1:N
## after re-ordering columns, have to change it in one of the VCFs to 
use rbind
colData(vcf2) <- colData(vcf1)
vcf <- rbind(vcf1, vcf2)

seqnames(vcf)
## factor-Rle of length 226660 with 48 runs
##   Lengths: 19000  7086 11905  9249  3070 ...  2050  1568  1855   824 
     5
##   Values :     1    10    11    12    13 ...     7     8     9     X 
     Y
## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y

## re-order rows so chromosomes are in blocks
vcf <- vcf[order(rownames(vcf)),]
seqnames(vcf)
## factor-Rle of length 226660 with 24 runs
##   Lengths: 23538  8847 14865 11463  3835 ... 10391  8244  9637  4695 
    33
##   Values :     1    10    11    12    13 ...     7     8     9     X 
     Y
## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y

## we can also order by chromosome and position
chrom <- as.character(seqnames(vcf))
chrom[chrom == "X"] <- 23
chrom[chrom == "Y"] <- 24
chrom <- as.integer(chrom)
pos <- start(ranges(rowData(vcf)))
vcf <- vcf[order(chrom, pos),]
seqnames(vcf)
## factor-Rle of length 226660 with 24 runs
##   Lengths: 23538 16118 13559  9365 10466 ...  6120  2642  4982  4695 
    33
##   Values :     1     2     3     4     5 ...    20    21    22     X 
     Y
## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y


Thanks for the helpful additions!

Stephanie

On 2/5/13 1:43 PM, Valerie Obenchain wrote:
> rbind and cbind are now implemented for SummarizedExperiment
> (GenomicRanges 1.11.29) and VCF (VariantAnnotation 1.5.34).
>
> rbind is appropriate for the case of different ranges (variants) and the
> same samples. cbind is appropriate for the same ranges and different
> samples.
>
> Let me know if you run into problems/bugs.
>
> Valerie
>
>
> On 01/11/2013 02:22 PM, Stephanie M. Gogarten wrote:
>> Hi all,
>>
>> Does VariantAnnotation currently have a method to merge VCF objects?
>> I've been looking through the documentation and code and haven't found
>> anything like this. If not, I think it would be a useful feature to add.
>>
>> My use case: I have two VCF files, with the same samples (but in
>> different order in each file). The two files have non-overlapping
>> variants. I would love to have an rbind(VCF, VCF) method; then I could
>> do something like:
>>
>> vcf2 <- vcf2[,colnames(vcf1)]
>> vcf <- rbind(vcf1, vcf2)
>>
>> cbind() would also be useful, for combining files with the same variants
>> but different samples.
>>
>> thanks,
>> Stephanie
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list