[BioC] Group millions of the same DNA sequences?

Tue Nov 16 23:32:28 CET 2010

You can use the aggregate function, i.e. if X is a vector of your sequences, i.e. X[i] is a character containing your i-th sequence (i=1,2,...,100M) then do
y <- aggregate(X,list(X),length)
then y is a two columns data.frame with column 1 containing the sequence and column 2 containing the count.
If this is too slow in R, use sort -u on Unix (like Wei suggested) to get the sequences sorted and unique and then write a simple C program which runs over all your sequences and adds 1 to count[i] where i is this sequence's index in the unique and sorted list (you can use binary search to get that i).

--- On Tue, 16/11/10, Xiaohui Wu <wux3 at muohio.edu> wrote:

> From: Xiaohui Wu <wux3 at muohio.edu>
> Subject: [BioC] Group millions of the same DNA sequences?
> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
> Received: Tuesday, 16 November, 2010, 9:46 PM
> Hi all,
> 
> I have millions like 100M DNA reads each of which is
> ~150nt, some of them are duplicate. Is there any way to
> group the same sequences into one and count the number, like
> unique() function in R, but with the occurrence of read and
> also more efficient? 
> Also, if I want to cluster these 100M  reads based on
> their similarity, like editor distance or some distance
> <=2, is there some function or package can be used? 
> Thank you!
> 
> Xiaohui
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>