[BioC] merging VCF files

Valerie Obenchain vobencha at fhcrc.org
Mon Feb 25 22:27:06 CET 2013


Hi Stephanie,

In the case of same samples, different order I've renamed 'Samples' to 
'Samples.*' and kept them all. cbind() also keeps all columns so at 
least this is consistent between the two. An error will still be thrown 
for all other columns in colData with the same name but different data.

Here is the behavior in 1.5.40.

VCF file 1:
 > colData(vcf)
DataFrame with 3 rows and 1 column
           Samples
         <integer>
NA00001         1
NA00002         2
NA00003         3

VCF file 2 with samples in different order:
 > colData(vcf2)
DataFrame with 3 rows and 1 column
           Samples
         <integer>
NA00003         1
NA00002         2
NA00001         3

Reorder the second file so we can rbind with the first.
 > vcf2 <- vcf2[,colnames(vcf)]
 > colData(vcf2)
DataFrame with 3 rows and 1 column
           Samples
         <integer>
NA00001         3
NA00002         2
NA00003         1

The result keeps all columns. The .* extension corresponds to the order 
the files were input to rbind().

 > res <- rbind(vcf, vcf2)
 > colData(res)
DataFrame with 3 rows and 2 columns
   Samples.1 Samples.2
   <integer> <integer>
1         1         3
2         2         2
3         3         1

I missed the point of reordering by chromosome blocks in your example 
below. Was there trouble with the ordering of other slots after rbind() 
or ... ?

Thanks,
Valerie

On 02/15/2013 01:56 PM, Stephanie M. Gogarten wrote:
> I don't think that this qualifies as a "problem," but I found that an
> extra step is needed (beyond my initial simple use case) to harmonize
> colData if the colnames of the VCFs have to be re-ordered.  From the man
> page for "VCF-class" it is not clear that "colData(x) <- value" is
> possible, but happily it works.
>
> Here is what I did:
>
> svp <- ScanVcfParam(geno="GT")
> vcf1 <- readVcf(vcffile1, "hg19", svp)
> vcf2 <- readVcf(vcffile2, "hg19", svp)
>
> ## vcf1 and vcf2 have the same samples, but in different order
> vcf2 <- vcf2[,colnames(vcf1)]
> vcf <- rbind(vcf1, vcf2)
> ## Error in FUN("Samples"[[1L]], ...) :
> ##   column(s) 'Samples' in ‘colData’ are duplicated and the data do not
> match
>
> ## colData is automatically created by readVcf with "Samples" ordered 1:N
> ## after re-ordering columns, have to change it in one of the VCFs to
> use rbind
> colData(vcf2) <- colData(vcf1)
> vcf <- rbind(vcf1, vcf2)
>
> seqnames(vcf)
> ## factor-Rle of length 226660 with 48 runs
> ##   Lengths: 19000  7086 11905  9249  3070 ...  2050  1568  1855   824
>      5
> ##   Values :     1    10    11    12    13 ...     7     8     9     X
>      Y
> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y
>
> ## re-order rows so chromosomes are in blocks
> vcf <- vcf[order(rownames(vcf)),]
> seqnames(vcf)
> ## factor-Rle of length 226660 with 24 runs
> ##   Lengths: 23538  8847 14865 11463  3835 ... 10391  8244  9637  4695
>     33
> ##   Values :     1    10    11    12    13 ...     7     8     9     X
>      Y
> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y
>
> ## we can also order by chromosome and position
> chrom <- as.character(seqnames(vcf))
> chrom[chrom == "X"] <- 23
> chrom[chrom == "Y"] <- 24
> chrom <- as.integer(chrom)
> pos <- start(ranges(rowData(vcf)))
> vcf <- vcf[order(chrom, pos),]
> seqnames(vcf)
> ## factor-Rle of length 226660 with 24 runs
> ##   Lengths: 23538 16118 13559  9365 10466 ...  6120  2642  4982  4695
>     33
> ##   Values :     1     2     3     4     5 ...    20    21    22     X
>      Y
> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X Y
>
>
> Thanks for the helpful additions!
>
> Stephanie
>
> On 2/5/13 1:43 PM, Valerie Obenchain wrote:
>> rbind and cbind are now implemented for SummarizedExperiment
>> (GenomicRanges 1.11.29) and VCF (VariantAnnotation 1.5.34).
>>
>> rbind is appropriate for the case of different ranges (variants) and the
>> same samples. cbind is appropriate for the same ranges and different
>> samples.
>>
>> Let me know if you run into problems/bugs.
>>
>> Valerie
>>
>>
>> On 01/11/2013 02:22 PM, Stephanie M. Gogarten wrote:
>>> Hi all,
>>>
>>> Does VariantAnnotation currently have a method to merge VCF objects?
>>> I've been looking through the documentation and code and haven't found
>>> anything like this. If not, I think it would be a useful feature to add.
>>>
>>> My use case: I have two VCF files, with the same samples (but in
>>> different order in each file). The two files have non-overlapping
>>> variants. I would love to have an rbind(VCF, VCF) method; then I could
>>> do something like:
>>>
>>> vcf2 <- vcf2[,colnames(vcf1)]
>>> vcf <- rbind(vcf1, vcf2)
>>>
>>> cbind() would also be useful, for combining files with the same variants
>>> but different samples.
>>>
>>> thanks,
>>> Stephanie
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>



More information about the Bioconductor mailing list