[BioC] Group millions of the same DNA sequences?

Wed Nov 17 02:44:59 CET 2010

Hi Wei,

Thank you for your reply! I'll be very appreciated if you could send me your C code for reference.

Xiaohui 

-------------------------------------------------------------
发件人：Wei Shi
发送日期：2010-11-17 05:55:04
收件人：Wu, Xiaohui Ms.
抄送：bioconductor at stat.math.ethz.ch
主题：Re: [BioC] Group millions of the same DNA sequences?

Dear Xiaohui:

	The unix command sort (which is also available on Mac) can group your sequences. But I guess you will have to write some code to count the numbers for each distinct sequence. I have written some C code to do the similar thing and I will be happy to share them. But I am not sure if your data format is the same as mine. 

	I am not aware of any R functions which can cluster strings. I image it will be extremely slow if there are such functions.

	Hope this helps.

Cheers,
Wei

On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote:

> Hi all,
> 
> I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient? 
> Also, if I want to cluster these 100M  reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used? 
> Thank you!
> 
> Xiaohui
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

______________________________________________________________________
The information in this email is confidential and intended solely for the addressee.
You must not disclose, forward, print or use it without the permission of the sender.
______________________________________________________________________
.