[BioC] Best practices to find intersection among variants

Blanchette, Marco MAB at stowers.org
Tue Aug 26 16:19:28 CEST 2014


All right! Thank you guys.

Here is my little pipeline
1) slurping in the vcf files in a directory
2) Filtering out the one that didn’t passed the GATK filtering step (This
could be probably be done during reading from file, couldn’t figure out
how yet…)
3) finding the common one among all the files
4) writing back the file to disk

 Feel free to comment if you think there would be better ways to do that
(It’s already better than pairwise comparison of file using java command
line :-})


library(VariantAnnotation)

vcfDir <- "./filtSNPs"

vcfFiles <- list.files(vcfDir,"\\.vcf$",full=TRUE)

vcfs <- lapply(vcfFiles,readVcf,"spombe")

filtered.vcfs <- lapply(vcfs,function(vcf) vcf[filt(vcf) == 'PASS'])

intersectVCF <- function(v1,v2){
    m <- match(as(v1,'VRanges'),as(v2,'VRanges'))
    v2[m[!is.na(m)]]
}

commonVCFs <- Reduce(intersectVCF,filtered.vcfs)

writeVcf(commonVCFs,file.path(vcfDir,"commonSNP.vcf"),index=TRUE)


Thanks



--  Marco Blanchette, Ph.D.
Genomic Scientist
Stowers Institute for Medical Research
1000 East 50th Street
Kansas City MO 64110
www.stowers.org


Tel: 816-926-4071 
Cell: 816-726-8419 
Fax: 816-926-2018





On 8/25/14, 8:01 PM, "Michael Lawrence" <lawrence.michael at gene.com> wrote:

>Just to clarify, you'll need to get a VRanges via
>
>vr <- as(vcf, "VRanges")
>
>Michael
>
>
>On Mon, Aug 25, 2014 at 5:13 PM, Julian Gehring <julian.gehring at embl.de>
>wrote:
>
>> Hi Marco,
>>
>> Essentially, you want to find a intersection with both the locus and the
>> ref/alt alleles matching.  The 'VariantAnnotation' has the 'match'
>>function
>> which does this (matching the alt allele).  Here is some of my code for
>> getting the common variants from two 'VRanges' objects:
>>
>>   intersectVRanges <- function(x, y) {
>>       m = match(x, y)
>>       s1 = x[!is.na(m)]
>>       s2 = y[m[!is.na(m)]]
>>       res = list(s1, s2)
>>       return(res)
>>   }
>>
>> Best
>> Julian
>>
>>
>> On 25.08.2014 16:35, Blanchette, Marco wrote:
>>
>>> I am very very new to working with variants, maybe this question is
>>>very
>>> basic but I need to get kickstarted a bit�
>>>
>>> Just ran an analysis to find the common variation in a set of lab
>>>strains
>>> used in house in the haploid genomes of S. pombe. I used GATK best
>>> practices and I am at the stage where I have filtered variants and I
>>>would
>>> like to find the common ones. My first intuition for this was to turn
>>>to R
>>> (tired to running Java command line�) and I fumbled on
>>>VariantAnnotation
>>> which uses my favorite object, GRanges, under the hood. So I should be
>>>good.
>>>
>>> However, I can�t seem to figure out how to create intersect and I am a
>>> bit nervous as I want to find the common variants, not just the
>>>location (I
>>> can see that I could extract the Granges and do overlap operation on
>>>them,
>>> but then, I run into the danger of losing the variant information�).
>>>
>>> Could anyone provide a simple workflow to get me started? I could
>>>provide
>>> a basic starting code but it would only be restricted to loading the
>>> VariantAnnotation package and reading two vcf files� so I figured that
>>>I
>>> would not add it�
>>>
>>> Thanks a bunch and sorry for the very vcf newby question.
>>>
>>>
>>> --  Marco Blanchette, Ph.D.
>>> Genomic Scientist
>>> Stowers Institute for Medical Research
>>> 1000 East 50th Street
>>> Kansas City MO 64110
>>> www.stowers.org
>>>
>>> Tel: 816-926-4071
>>> Cell: 816-726-8419
>>> Fax: 816-926-2018
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.
>>> science.biology.informatics.conductor
>>>
>>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.
>> science.biology.informatics.conductor
>>
>
>	[[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list