[BioC] faster way to get differential calls from pileup?

Vincent Carey stvjc at channing.harvard.edu
Sun Oct 17 00:23:13 CEST 2010


On Sat, Oct 16, 2010 at 5:14 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
> On Oct 16, 2010 2:55 PM, "Hollis Wright" <wrighth at ohsu.edu> wrote:
>> Hi, all; I've got a pair of lanes of exome sequencing data; we've
>> generated pileup files from samtools and we're interested in looking
>> for discordant calls for quality control or snp discovery. As best I
>> can figure out the way to do this involves doing a findOverlaps and
>> the programatically iterating through the match matrix to get the
>> matching positions and check for differences. However, the overlap
>> finding takes several hours, and since we anticipate there being many
>> lanes in the future I'm curious if there's a faster or better way to
>> go about this sort of process. Thanks...
>>
>
> Hi, Hollis.  Have you considered converting to VCF format and using some of
> the VCF tools for this type of thing?  With VCF, you get one row per locus
> with the genotypes for all your samples in that row.  Conversion to
> tab-delimited text is also possible for processing in R.  I think Vince
> Carey was looking into R tools for working with VCF, but I don't know where
> that work stands.

there is a vcf2sm function in GGtools "devel" branch (should get to
release with luck this monday).  the intention is
to take a compressed tabix-indexed vcf file (as distributed by 1000
genomes, specifically) and create a snpMatrix snp.matrix instance for
genotype calls on a chromosome for all individuals archived in a file.
 the current code was written a while ago
and emphasizes small footprint, with a naive piping interface assuming
tabix installed and trivially accessible.  there is much more
information available in VCF that this code makes no effort to
extract.  some discussion of VCF harvesting would be in order at the
Heidelberg developer meeting, and comments from interested
developers/users are welcome.

>
> All that said, several hours for finding overlaps sounds like a long time
> for a couple of pileup outputs from exome sequencing.
>
> Sean
>
>> Hollis Wright
>>
>> Sent from my iPhone
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list