[BioC] merging VCF files

Stephanie M. Gogarten sdmorris at u.washington.edu
Mon Feb 25 22:47:05 CET 2013


No problem with ordering of rbind, just (perhaps) a common use case to 
put in an example.

thanks,
Stephanie

On 2/25/13 1:27 PM, Valerie Obenchain wrote:
> Hi Stephanie,
>
> In the case of same samples, different order I've renamed 'Samples' to
> 'Samples.*' and kept them all. cbind() also keeps all columns so at
> least this is consistent between the two. An error will still be thrown
> for all other columns in colData with the same name but different data.
>
> Here is the behavior in 1.5.40.
>
> VCF file 1:
>  > colData(vcf)
> DataFrame with 3 rows and 1 column
>            Samples
>          <integer>
> NA00001         1
> NA00002         2
> NA00003         3
>
> VCF file 2 with samples in different order:
>  > colData(vcf2)
> DataFrame with 3 rows and 1 column
>            Samples
>          <integer>
> NA00003         1
> NA00002         2
> NA00001         3
>
> Reorder the second file so we can rbind with the first.
>  > vcf2 <- vcf2[,colnames(vcf)]
>  > colData(vcf2)
> DataFrame with 3 rows and 1 column
>            Samples
>          <integer>
> NA00001         3
> NA00002         2
> NA00003         1
>
> The result keeps all columns. The .* extension corresponds to the order
> the files were input to rbind().
>
>  > res <- rbind(vcf, vcf2)
>  > colData(res)
> DataFrame with 3 rows and 2 columns
>    Samples.1 Samples.2
>    <integer> <integer>
> 1         1         3
> 2         2         2
> 3         3         1
>
> I missed the point of reordering by chromosome blocks in your example
> below. Was there trouble with the ordering of other slots after rbind()
> or ... ?
>
> Thanks,
> Valerie
>
> On 02/15/2013 01:56 PM, Stephanie M. Gogarten wrote:
>> I don't think that this qualifies as a "problem," but I found that an
>> extra step is needed (beyond my initial simple use case) to harmonize
>> colData if the colnames of the VCFs have to be re-ordered.  From the man
>> page for "VCF-class" it is not clear that "colData(x) <- value" is
>> possible, but happily it works.
>>
>> Here is what I did:
>>
>> svp <- ScanVcfParam(geno="GT")
>> vcf1 <- readVcf(vcffile1, "hg19", svp)
>> vcf2 <- readVcf(vcffile2, "hg19", svp)
>>
>> ## vcf1 and vcf2 have the same samples, but in different order
>> vcf2 <- vcf2[,colnames(vcf1)]
>> vcf <- rbind(vcf1, vcf2)
>> ## Error in FUN("Samples"[[1L]], ...) :
>> ##   column(s) 'Samples' in ‘colData’ are duplicated and the data do not
>> match
>>
>> ## colData is automatically created by readVcf with "Samples" ordered 1:N
>> ## after re-ordering columns, have to change it in one of the VCFs to
>> use rbind
>> colData(vcf2) <- colData(vcf1)
>> vcf <- rbind(vcf1, vcf2)
>>
>> seqnames(vcf)
>> ## factor-Rle of length 226660 with 48 runs
>> ##   Lengths: 19000  7086 11905  9249  3070 ...  2050  1568  1855   824
>>      5
>> ##   Values :     1    10    11    12    13 ...     7     8     9     X
>>      Y
>> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8
>> 9 X Y
>>
>> ## re-order rows so chromosomes are in blocks
>> vcf <- vcf[order(rownames(vcf)),]
>> seqnames(vcf)
>> ## factor-Rle of length 226660 with 24 runs
>> ##   Lengths: 23538  8847 14865 11463  3835 ... 10391  8244  9637  4695
>>     33
>> ##   Values :     1    10    11    12    13 ...     7     8     9     X
>>      Y
>> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8
>> 9 X Y
>>
>> ## we can also order by chromosome and position
>> chrom <- as.character(seqnames(vcf))
>> chrom[chrom == "X"] <- 23
>> chrom[chrom == "Y"] <- 24
>> chrom <- as.integer(chrom)
>> pos <- start(ranges(rowData(vcf)))
>> vcf <- vcf[order(chrom, pos),]
>> seqnames(vcf)
>> ## factor-Rle of length 226660 with 24 runs
>> ##   Lengths: 23538 16118 13559  9365 10466 ...  6120  2642  4982  4695
>>     33
>> ##   Values :     1     2     3     4     5 ...    20    21    22     X
>>      Y
>> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8
>> 9 X Y
>>
>>
>> Thanks for the helpful additions!
>>
>> Stephanie
>>
>> On 2/5/13 1:43 PM, Valerie Obenchain wrote:
>>> rbind and cbind are now implemented for SummarizedExperiment
>>> (GenomicRanges 1.11.29) and VCF (VariantAnnotation 1.5.34).
>>>
>>> rbind is appropriate for the case of different ranges (variants) and the
>>> same samples. cbind is appropriate for the same ranges and different
>>> samples.
>>>
>>> Let me know if you run into problems/bugs.
>>>
>>> Valerie
>>>
>>>
>>> On 01/11/2013 02:22 PM, Stephanie M. Gogarten wrote:
>>>> Hi all,
>>>>
>>>> Does VariantAnnotation currently have a method to merge VCF objects?
>>>> I've been looking through the documentation and code and haven't found
>>>> anything like this. If not, I think it would be a useful feature to
>>>> add.
>>>>
>>>> My use case: I have two VCF files, with the same samples (but in
>>>> different order in each file). The two files have non-overlapping
>>>> variants. I would love to have an rbind(VCF, VCF) method; then I could
>>>> do something like:
>>>>
>>>> vcf2 <- vcf2[,colnames(vcf1)]
>>>> vcf <- rbind(vcf1, vcf2)
>>>>
>>>> cbind() would also be useful, for combining files with the same
>>>> variants
>>>> but different samples.
>>>>
>>>> thanks,
>>>> Stephanie
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>



More information about the Bioconductor mailing list