[BioC] merging VCF files
Stephanie M. Gogarten
sdmorris at u.washington.edu
Mon Feb 25 22:47:05 CET 2013
No problem with ordering of rbind, just (perhaps) a common use case to
put in an example.
thanks,
Stephanie
On 2/25/13 1:27 PM, Valerie Obenchain wrote:
> Hi Stephanie,
>
> In the case of same samples, different order I've renamed 'Samples' to
> 'Samples.*' and kept them all. cbind() also keeps all columns so at
> least this is consistent between the two. An error will still be thrown
> for all other columns in colData with the same name but different data.
>
> Here is the behavior in 1.5.40.
>
> VCF file 1:
> > colData(vcf)
> DataFrame with 3 rows and 1 column
> Samples
> <integer>
> NA00001 1
> NA00002 2
> NA00003 3
>
> VCF file 2 with samples in different order:
> > colData(vcf2)
> DataFrame with 3 rows and 1 column
> Samples
> <integer>
> NA00003 1
> NA00002 2
> NA00001 3
>
> Reorder the second file so we can rbind with the first.
> > vcf2 <- vcf2[,colnames(vcf)]
> > colData(vcf2)
> DataFrame with 3 rows and 1 column
> Samples
> <integer>
> NA00001 3
> NA00002 2
> NA00003 1
>
> The result keeps all columns. The .* extension corresponds to the order
> the files were input to rbind().
>
> > res <- rbind(vcf, vcf2)
> > colData(res)
> DataFrame with 3 rows and 2 columns
> Samples.1 Samples.2
> <integer> <integer>
> 1 1 3
> 2 2 2
> 3 3 1
>
> I missed the point of reordering by chromosome blocks in your example
> below. Was there trouble with the ordering of other slots after rbind()
> or ... ?
>
> Thanks,
> Valerie
>
> On 02/15/2013 01:56 PM, Stephanie M. Gogarten wrote:
>> I don't think that this qualifies as a "problem," but I found that an
>> extra step is needed (beyond my initial simple use case) to harmonize
>> colData if the colnames of the VCFs have to be re-ordered. From the man
>> page for "VCF-class" it is not clear that "colData(x) <- value" is
>> possible, but happily it works.
>>
>> Here is what I did:
>>
>> svp <- ScanVcfParam(geno="GT")
>> vcf1 <- readVcf(vcffile1, "hg19", svp)
>> vcf2 <- readVcf(vcffile2, "hg19", svp)
>>
>> ## vcf1 and vcf2 have the same samples, but in different order
>> vcf2 <- vcf2[,colnames(vcf1)]
>> vcf <- rbind(vcf1, vcf2)
>> ## Error in FUN("Samples"[[1L]], ...) :
>> ## column(s) 'Samples' in ‘colData’ are duplicated and the data do not
>> match
>>
>> ## colData is automatically created by readVcf with "Samples" ordered 1:N
>> ## after re-ordering columns, have to change it in one of the VCFs to
>> use rbind
>> colData(vcf2) <- colData(vcf1)
>> vcf <- rbind(vcf1, vcf2)
>>
>> seqnames(vcf)
>> ## factor-Rle of length 226660 with 48 runs
>> ## Lengths: 19000 7086 11905 9249 3070 ... 2050 1568 1855 824
>> 5
>> ## Values : 1 10 11 12 13 ... 7 8 9 X
>> Y
>> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8
>> 9 X Y
>>
>> ## re-order rows so chromosomes are in blocks
>> vcf <- vcf[order(rownames(vcf)),]
>> seqnames(vcf)
>> ## factor-Rle of length 226660 with 24 runs
>> ## Lengths: 23538 8847 14865 11463 3835 ... 10391 8244 9637 4695
>> 33
>> ## Values : 1 10 11 12 13 ... 7 8 9 X
>> Y
>> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8
>> 9 X Y
>>
>> ## we can also order by chromosome and position
>> chrom <- as.character(seqnames(vcf))
>> chrom[chrom == "X"] <- 23
>> chrom[chrom == "Y"] <- 24
>> chrom <- as.integer(chrom)
>> pos <- start(ranges(rowData(vcf)))
>> vcf <- vcf[order(chrom, pos),]
>> seqnames(vcf)
>> ## factor-Rle of length 226660 with 24 runs
>> ## Lengths: 23538 16118 13559 9365 10466 ... 6120 2642 4982 4695
>> 33
>> ## Values : 1 2 3 4 5 ... 20 21 22 X
>> Y
>> ## Levels(24): 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8
>> 9 X Y
>>
>>
>> Thanks for the helpful additions!
>>
>> Stephanie
>>
>> On 2/5/13 1:43 PM, Valerie Obenchain wrote:
>>> rbind and cbind are now implemented for SummarizedExperiment
>>> (GenomicRanges 1.11.29) and VCF (VariantAnnotation 1.5.34).
>>>
>>> rbind is appropriate for the case of different ranges (variants) and the
>>> same samples. cbind is appropriate for the same ranges and different
>>> samples.
>>>
>>> Let me know if you run into problems/bugs.
>>>
>>> Valerie
>>>
>>>
>>> On 01/11/2013 02:22 PM, Stephanie M. Gogarten wrote:
>>>> Hi all,
>>>>
>>>> Does VariantAnnotation currently have a method to merge VCF objects?
>>>> I've been looking through the documentation and code and haven't found
>>>> anything like this. If not, I think it would be a useful feature to
>>>> add.
>>>>
>>>> My use case: I have two VCF files, with the same samples (but in
>>>> different order in each file). The two files have non-overlapping
>>>> variants. I would love to have an rbind(VCF, VCF) method; then I could
>>>> do something like:
>>>>
>>>> vcf2 <- vcf2[,colnames(vcf1)]
>>>> vcf <- rbind(vcf1, vcf2)
>>>>
>>>> cbind() would also be useful, for combining files with the same
>>>> variants
>>>> but different samples.
>>>>
>>>> thanks,
>>>> Stephanie
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>
More information about the Bioconductor
mailing list