[BioC] comparing multiple individual genomes in bioc?

Wed Jan 28 00:25:07 CET 2009

Kasper Daniel Hansen wrote:
> This is a rather difficult question to answer because no-one has really 
> done what you are proposing to do. It is true that many people hope to 
> be able to do something like this, but it is unclear how much data we 
> are talking about, what form the data is in and finally what kind of 
> stuff we want to do with the data. Without a more clear specification it 
> is pretty hard to answer this.
> 
> There is for example a difference in whether the data is just a list of 
> SNPs, a collection of genomes in FASTA format or a (big) collection of 
> short reads that needs to be assembled.
> 
> Given that no-one has done this, it is clear that the first attempts 
> will involve a lot of custom code, so don't expect any off-the-shelf 
> method for any suite of software. I will say that Bioconductor is a 
> priori not more nor less suitable for this than any other piece of 
> software.

Hmm. Some of the Bioc infrastructure might make for great building 
blocks. The BSgenome package has facilities for dealing with 
genome-scale data, especially reference genomes. Biostrings tools for 
custom and comparatively fast pattern matching seem very suitable to 
exploratory analysis. The conceptual foundation of IRanges and Rle 
classes seem well-suited to efficient representation of genome-scale 
'features of interest' coupled with a flexibility for investigating 
genome-scale questions. For instance, if SNPs were represented as RLEs, 
it would be straight-forward to summarize site-specific SNP abundance 
(just add the RLEs using '+') in memory-efficient ways. Likewise the 
overlap function of IRanges might provide a very useful tool for rapidly 
filtering per-genome annotations to identify features that are shared 
across samples. Plus the usual arguments for R viz., established 
statistical and visualization tools and ready interface to data bases, 
web resources, etc.

Agreed though that the question is open ended and therefore hard to 
answer. It would be really exciting to here use cases sketched out, here 
or on the bioc-sig-sequencing mailing list.

Martin

> Kasper
> 
> On Jan 27, 2009, at 9:21 , Paul Shannon wrote:
> 
>> I am becoming acquainted with the bioc packages helpful in DNA 
>> sequence analysis.  There is lots of nice stuff.
>>
>> We (like the rest of the world...) hope to soon have many individual 
>> human genomes.  We wish to compare them, looking for fine-grained 
>> variations in intragenic and extragenic regions, for clues to 
>> phenotypic variety.
>>
>> Is there support for this kind of analysis in bioc?  If not, is this 
>> planned or hoped for?
>>
>> Thanks -
>>
>> - Paul Shannon
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793