[BioC] Group millions of the same DNA sequences?

Harris A. Jaffee hj at jhu.edu
Tue Nov 16 23:38:42 CET 2010


Sorry if I'm missing the point, but ...

For the easier question with edit distance zero, see ?rle, as in

 > reads <- c("ACG", "TTT", "ACG")
 > rle(sort(reads))
Run Length Encoding
   lengths: int [1:2] 2 1
   values : chr [1:2] "ACG" "TTT"

Beware that, without the sort, rle won't be productive.

A cheap alternative, from IRanges, that doesn't yield quite as much
information:

 > high2low(reads)
[1] NA NA  1

I think of high2low as a gross abstraction of PDict, from Biostrings,
as in

 > D <- PDict(c("ACG", "TTT", "ACG"))

 > D at dups0@high2low
[1] NA NA  1

For the harder question, you might explore this:
	
	PDict(your_reads, max.mismatch=2)

-Harris

On Nov 16, 2010, at 4:54 PM, Wei Shi wrote:

> Dear Xiaohui:
>
> 	The unix command sort (which is also available on Mac) can group  
> your sequences. But I guess you will have to write some code to  
> count the numbers for each distinct sequence. I have written some C  
> code to do the similar thing and I will be happy to share them. But  
> I am not sure if your data format is the same as mine.
>
> 	I am not aware of any R functions which can cluster strings. I  
> image it will be extremely slow if there are such functions.
>
> 	Hope this helps.
>
> Cheers,
> Wei
>
> On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote:
>
>> Hi all,
>>
>> I have millions like 100M DNA reads each of which is ~150nt, some  
>> of them are duplicate. Is there any way to group the same  
>> sequences into one and count the number, like unique() function in  
>> R, but with the occurrence of read and also more efficient?
>> Also, if I want to cluster these 100M  reads based on their  
>> similarity, like editor distance or some distance <=2, is there  
>> some function or package can be used?
>> Thank you!
>>
>> Xiaohui
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/ 
>> gmane.science.biology.informatics.conductor
>
>
> ______________________________________________________________________
> The information in this email is confidential and intend... 
> {{dropped:6}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/ 
> gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list