[BioC] Best practices to find intersection among variants

Blanchette, Marco MAB at stowers.org
Wed Aug 27 15:52:39 CEST 2014


Thanks Michael, 

Just as a clarification, one needs to first re-cast the VCF as VRanges,
right? I just tried it, only work if:

 intersectVCF <- function(v1,v2){
                   
            
    ## In the real life, I would make that a one operation but for the
sake of clarity
    m <- as(v1,'VRanges’) %in% (v2,'VRanges')
    v1[[m]]        
                   
  
    } 

Also, I could not find where the %in% operator is described in the
VariantAnnotation documentation. I understand that VRanges are derived
from GRanges and one would expect most of the method to work but I guess
that if the %in% operator also does comparison on the variant alleles to
also be identical (didn’t test it though), it might be important to point
it out.

Thanks a bunch, again, great package! Easy to use.


--  Marco Blanchette, Ph.D.
Genomic Scientist
Stowers Institute for Medical Research
1000 East 50th Street
Kansas City MO 64110
www.stowers.org


Tel: 816-926-4071 
Cell: 816-726-8419 
Fax: 816-926-2018


From:  Michael Lawrence <lawrence.michael at gene.com>
Date:  Tuesday, August 26, 2014 at 9:58 AM
To:  "... ..." <mab at stowers.org>
Cc:  Michael Lawrence <lawrence.michael at gene.com>, Julian Gehring
<julian.gehring at embl.de>, BioConductor <bioconductor at stat.math.ethz.ch>
Subject:  Re: [BioC] Best practices to find intersection among variants





On Tue, Aug 26, 2014 at 7:19 AM, Blanchette, Marco
<MAB at stowers.org> wrote:

All right! Thank you guys.

Here is my little pipeline
1) slurping in the vcf files in a directory
2) Filtering out the one that didn’t passed the GATK filtering step (This
could be probably be done during reading from file, couldn’t figure out
how yet…)
3) finding the common one among all the files
4) writing back the file to disk

 Feel free to comment if you think there would be better ways to do that
(It’s already better than pairwise comparison of file using java command
line :-})


library(VariantAnnotation)

vcfDir <- "./filtSNPs"

vcfFiles <- list.files(vcfDir,"\\.vcf$",full=TRUE)




Consider tools::list_files_with_exts() instead of the above.

 

vcfs <- lapply(vcfFiles,readVcf,"spombe")

filtered.vcfs <- lapply(vcfs,function(vcf) vcf[filt(vcf) == 'PASS'])


intersectVCF <- function(v1,v2){
    m <- match(as(v1,'VRanges'),as(v2,'VRanges'))
    v2[m[!is.na <http://is.na>(m)]]
}




Since you're just looking for v2 that are in v1, use:

intersectVCF <- function(v1,v2) {

  v2[v2 %in% v1]
}




commonVCFs <- Reduce(intersectVCF,filtered.vcfs)

writeVcf(commonVCFs,file.path(vcfDir,"commonSNP.vcf"),index=TRUE)


Thanks



--  Marco Blanchette, Ph.D.
Genomic Scientist
Stowers Institute for Medical Research
1000 East 50th Street
Kansas City MO 64110
www.stowers.org <http://www.stowers.org>


Tel: 816-926-4071 <tel:816-926-4071>
Cell: 816-726-8419 <tel:816-726-8419>
Fax: 816-926-2018 <tel:816-926-2018>






On 8/25/14, 8:01 PM, "Michael Lawrence" <lawrence.michael at gene.com> wrote:


>Just to clarify, you'll need to get a VRanges via
>
>vr <- as(vcf, "VRanges")
>
>Michael
>
>
>On Mon, Aug 25, 2014 at 5:13 PM, Julian Gehring <julian.gehring at embl.de>
>wrote:
>
>> Hi Marco,
>>
>> Essentially, you want to find a intersection with both the locus and the
>> ref/alt alleles matching.  The 'VariantAnnotation' has the 'match'
>>function
>> which does this (matching the alt allele).  Here is some of my code for
>> getting the common variants from two 'VRanges' objects:
>>
>>   intersectVRanges <- function(x, y) {
>>       m = match(x, y)
>>       s1 = x[!is.na <http://is.na>(m)]
>>       s2 = y[m[!is.na <http://is.na>(m)]]
>>       res = list(s1, s2)
>>       return(res)
>>   }
>>
>> Best
>> Julian
>>
>>
>> On 25.08.2014 16 <tel:25.08.2014%2016>:35, Blanchette, Marco wrote:
>>
>>> I am very very new to working with variants, maybe this question is
>>>very
>>> basic but I need to get kickstarted a bit�
>>>
>>> Just ran an analysis to find the common variation in a set of lab
>>>strains
>>> used in house in the haploid genomes of S. pombe. I used GATK best
>>> practices and I am at the stage where I have filtered variants and I
>>>would
>>> like to find the common ones. My first intuition for this was to turn
>>>to R
>>> (tired to running Java command line�) and I fumbled on
>>>VariantAnnotation
>>> which uses my favorite object, GRanges, under the hood. So I should be
>>>good.
>>>
>>> However, I can�t seem to figure out how to create intersect and I am a
>>> bit nervous as I want to find the common variants, not just the
>>>location (I
>>> can see that I could extract the Granges and do overlap operation on
>>>them,
>>> but then, I run into the danger of losing the variant information�).
>>>
>>> Could anyone provide a simple workflow to get me started? I could
>>>provide
>>> a basic starting code but it would only be restricted to loading the
>>> VariantAnnotation package and reading two vcf files� so I figured that
>>>I
>>> would not add it�
>>>
>>> Thanks a bunch and sorry for the very vcf newby question.
>>>
>>>
>>> --  Marco Blanchette, Ph.D.
>>> Genomic Scientist
>>> Stowers Institute for Medical Research
>>> 1000 East 50th Street
>>> Kansas City MO 64110
>>> www.stowers.org <http://www.stowers.org>
>>>
>>> Tel: 816-926-4071 <tel:816-926-4071>
>>> Cell: 816-726-8419 <tel:816-726-8419>
>>> Fax: 816-926-2018 <tel:816-926-2018>
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> 
https://stat.ethz.ch/mailman/listinfo/bioconductor
<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.
>>> science.biology.informatics.conductor
>>>
>>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.
>> science.biology.informatics.conductor
>>
>
>       [[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor










More information about the Bioconductor mailing list