[BioC] Group millions of the same DNA sequences?

Wed Nov 17 02:09:22 CET 2010

On 11/16/2010 02:46 AM, Xiaohui Wu wrote:
> Hi all,
> 
> I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient? 

dna = Biostrings::DNAStringSet(<reads>)
ShortRead::tables(dna, Inf)[["top"]]

for this; also selectMethod(tables, "DNAStringSet") to see the code if
the format (named list) is not to your liking.

Martin

> Also, if I want to cluster these 100M  reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used? 
> Thank you!

> 
> Xiaohui
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793