[BioC] Best practices to find intersection among variants

Michael Lawrence lawrence.michael at gene.com
Tue Aug 26 16:58:08 CEST 2014


On Tue, Aug 26, 2014 at 7:19 AM, Blanchette, Marco <MAB at stowers.org> wrote:

> All right! Thank you guys.
>
> Here is my little pipeline
> 1) slurping in the vcf files in a directory
> 2) Filtering out the one that didn’t passed the GATK filtering step (This
> could be probably be done during reading from file, couldn’t figure out
> how yet…)
> 3) finding the common one among all the files
> 4) writing back the file to disk
>
>  Feel free to comment if you think there would be better ways to do that
> (It’s already better than pairwise comparison of file using java command
> line :-})
>
>
> library(VariantAnnotation)
>
> vcfDir <- "./filtSNPs"
>
> vcfFiles <- list.files(vcfDir,"\\.vcf$",full=TRUE)
>
>
Consider tools::list_files_with_exts() instead of the above.


> vcfs <- lapply(vcfFiles,readVcf,"spombe")
>
> filtered.vcfs <- lapply(vcfs,function(vcf) vcf[filt(vcf) == 'PASS'])
>
> intersectVCF <- function(v1,v2){
>     m <- match(as(v1,'VRanges'),as(v2,'VRanges'))
>     v2[m[!is.na(m)]]
> }
>
>
Since you're just looking for v2 that are in v1, use:

intersectVCF <- function(v1,v2) {
  v2[v2 %in% v1]
}


commonVCFs <- Reduce(intersectVCF,filtered.vcfs)
>
> writeVcf(commonVCFs,file.path(vcfDir,"commonSNP.vcf"),index=TRUE)
>
>
> Thanks
>
>
>
> --  Marco Blanchette, Ph.D.
> Genomic Scientist
> Stowers Institute for Medical Research
> 1000 East 50th Street
> Kansas City MO 64110
> www.stowers.org
>
>
> Tel: 816-926-4071
> Cell: 816-726-8419
> Fax: 816-926-2018
>
>
>
>
>
> On 8/25/14, 8:01 PM, "Michael Lawrence" <lawrence.michael at gene.com> wrote:
>
> >Just to clarify, you'll need to get a VRanges via
> >
> >vr <- as(vcf, "VRanges")
> >
> >Michael
> >
> >
> >On Mon, Aug 25, 2014 at 5:13 PM, Julian Gehring <julian.gehring at embl.de>
> >wrote:
> >
> >> Hi Marco,
> >>
> >> Essentially, you want to find a intersection with both the locus and the
> >> ref/alt alleles matching.  The 'VariantAnnotation' has the 'match'
> >>function
> >> which does this (matching the alt allele).  Here is some of my code for
> >> getting the common variants from two 'VRanges' objects:
> >>
> >>   intersectVRanges <- function(x, y) {
> >>       m = match(x, y)
> >>       s1 = x[!is.na(m)]
> >>       s2 = y[m[!is.na(m)]]
> >>       res = list(s1, s2)
> >>       return(res)
> >>   }
> >>
> >> Best
> >> Julian
> >>
> >>
> >> On 25.08.2014 16:35, Blanchette, Marco wrote:
> >>
> >>> I am very very new to working with variants, maybe this question is
> >>>very
> >>> basic but I need to get kickstarted a bit�
> >>>
> >>> Just ran an analysis to find the common variation in a set of lab
> >>>strains
> >>> used in house in the haploid genomes of S. pombe. I used GATK best
> >>> practices and I am at the stage where I have filtered variants and I
> >>>would
> >>> like to find the common ones. My first intuition for this was to turn
> >>>to R
> >>> (tired to running Java command line�) and I fumbled on
> >>>VariantAnnotation
> >>> which uses my favorite object, GRanges, under the hood. So I should be
> >>>good.
> >>>
> >>> However, I can�t seem to figure out how to create intersect and I am a
> >>> bit nervous as I want to find the common variants, not just the
> >>>location (I
> >>> can see that I could extract the Granges and do overlap operation on
> >>>them,
> >>> but then, I run into the danger of losing the variant information�).
> >>>
> >>> Could anyone provide a simple workflow to get me started? I could
> >>>provide
> >>> a basic starting code but it would only be restricted to loading the
> >>> VariantAnnotation package and reading two vcf files� so I figured that
> >>>I
> >>> would not add it�
> >>>
> >>> Thanks a bunch and sorry for the very vcf newby question.
> >>>
> >>>
> >>> --  Marco Blanchette, Ph.D.
> >>> Genomic Scientist
> >>> Stowers Institute for Medical Research
> >>> 1000 East 50th Street
> >>> Kansas City MO 64110
> >>> www.stowers.org
> >>>
> >>> Tel: 816-926-4071
> >>> Cell: 816-726-8419
> >>> Fax: 816-926-2018
> >>>
> >>>         [[alternative HTML version deleted]]
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives: http://news.gmane.org/gmane.
> >>> science.biology.informatics.conductor
> >>>
> >>>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives: http://news.gmane.org/gmane.
> >> science.biology.informatics.conductor
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> >_______________________________________________
> >Bioconductor mailing list
> >Bioconductor at r-project.org
> >https://stat.ethz.ch/mailman/listinfo/bioconductor
> >Search the archives:
> >http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>

	[[alternative HTML version deleted]]



More information about the Bioconductor mailing list