[BioC] faster way to get differential calls from pileup?

Vincent Carey stvjc at channing.harvard.edu
Sun Oct 17 01:57:26 CEST 2010


I forgot to mention that vcf2sm was used to construct the data
elements of the ceu1kg experimental data package -- approx 8 million
snp calls on each of 60 individuals; with expression data on 41
individuals from the GENEVAR archive.  ceu1kg also includes GRanges
instances annotating locations and names of the SNP.

On Sat, Oct 16, 2010 at 6:23 PM, Vincent Carey
<stvjc at channing.harvard.edu> wrote:
> On Sat, Oct 16, 2010 at 5:14 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>> On Oct 16, 2010 2:55 PM, "Hollis Wright" <wrighth at ohsu.edu> wrote:
>>> Hi, all; I've got a pair of lanes of exome sequencing data; we've
>>> generated pileup files from samtools and we're interested in looking
>>> for discordant calls for quality control or snp discovery. As best I
>>> can figure out the way to do this involves doing a findOverlaps and
>>> the programatically iterating through the match matrix to get the
>>> matching positions and check for differences. However, the overlap
>>> finding takes several hours, and since we anticipate there being many
>>> lanes in the future I'm curious if there's a faster or better way to
>>> go about this sort of process. Thanks...
>>>
>>
>> Hi, Hollis.  Have you considered converting to VCF format and using some of
>> the VCF tools for this type of thing?  With VCF, you get one row per locus
>> with the genotypes for all your samples in that row.  Conversion to
>> tab-delimited text is also possible for processing in R.  I think Vince
>> Carey was looking into R tools for working with VCF, but I don't know where
>> that work stands.
>
> there is a vcf2sm function in GGtools "devel" branch (should get to
> release with luck this monday).  the intention is
> to take a compressed tabix-indexed vcf file (as distributed by 1000
> genomes, specifically) and create a snpMatrix snp.matrix instance for
> genotype calls on a chromosome for all individuals archived in a file.
>  the current code was written a while ago
> and emphasizes small footprint, with a naive piping interface assuming
> tabix installed and trivially accessible.  there is much more
> information available in VCF that this code makes no effort to
> extract.  some discussion of VCF harvesting would be in order at the
> Heidelberg developer meeting, and comments from interested
> developers/users are welcome.
>
>>
>> All that said, several hours for finding overlaps sounds like a long time
>> for a couple of pileup outputs from exome sequencing.
>>
>> Sean
>>
>>> Hollis Wright
>>>
>>> Sent from my iPhone
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>



More information about the Bioconductor mailing list