[BioC] Best practices to find intersection among variants

Julian Gehring julian.gehring at embl.de
Tue Aug 26 18:58:27 CEST 2014


Hi Marco,

Getting a bit away from Bioconductor: You could also genotype all 
samples together with the GATK UGT which will give you the information 
about variant alleles over in each samples (as described in the 
documentation and best practices).  This is helpful for eliminating 
false positive/negative calls.

Best
Julian


On 26.08.2014 07:19, Blanchette, Marco wrote:
> All right! Thank you guys.
>
> Here is my little pipeline
> 1) slurping in the vcf files in a directory
> 2) Filtering out the one that didn’t passed the GATK filtering step (This
> could be probably be done during reading from file, couldn’t figure out
> how yet…)
> 3) finding the common one among all the files
> 4) writing back the file to disk
>
>   Feel free to comment if you think there would be better ways to do that
> (It’s already better than pairwise comparison of file using java command
> line :-})
>
>
> library(VariantAnnotation)
>
> vcfDir <- "./filtSNPs"
>
> vcfFiles <- list.files(vcfDir,"\\.vcf$",full=TRUE)
>
> vcfs <- lapply(vcfFiles,readVcf,"spombe")
>
> filtered.vcfs <- lapply(vcfs,function(vcf) vcf[filt(vcf) == 'PASS'])
>
> intersectVCF <- function(v1,v2){
>      m <- match(as(v1,'VRanges'),as(v2,'VRanges'))
>      v2[m[!is.na(m)]]
> }
>
> commonVCFs <- Reduce(intersectVCF,filtered.vcfs)
>
> writeVcf(commonVCFs,file.path(vcfDir,"commonSNP.vcf"),index=TRUE)
>
>
> Thanks
>



More information about the Bioconductor mailing list