[BioC] Biostrings - vcountPattern optimization

Thu Jul 22 19:32:39 CEST 2010

Hi Patrick,

Thanks, this looks promising.  Three possible complications are:
(1)  The fragments are not all the same width.  Is this possible with Pdict?
(2)  I allow a variable number of mismatches based on each individual fragment's width.
(3)  The fragments sometimes include ambiguity letters (IUPAC extended letters).

A more accurate example would be:

#### start ####
fragments <- DNAStringSet(c("ACS","NCCAGAA")) # no indels
sequence_set <- DNAStringSet(c("ATAGCATACKACCA","GATTACGTACCADADATTACA") # variable widths
for (i in 1:length(fragments)) {
	counts <- vcountPattern(fragments[[i]],
		sequence_set,
		max.mismatch=floor(length(fragments[[i]])/5)) # variable mis-matches
	hits <- length(which(counts > 0))
	print(hits)
}
#### end ####

Do think it is possible to make this work Pdict for a speed improvement?

Thanks again!,
Erik

On Jul 22, 2010, at 12:11 PM, Patrick Aboyoun wrote:

> Erik,
> Have you tried vcountPDict? It will use an Aho - Corasick matching algorithm (http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm) that is pretty fast, albeit memory intensive.
> 
> library(Biostrings)
> fragments<- DNAStringSet(c("ACTG","AAAA"))
> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC"))
> pdict<- PDict(fragments)
> counts<- vcountPDict(pdict, sequence_set)
> 
>> counts
>     [,1] [,2]
> [1,]    0    0
> [2,]    0    0
> 
>> sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-07-18 r52554)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
> 
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] Biostrings_2.17.26 IRanges_1.7.13
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.9.0 tools_2.12.0
> 
> 
> 
> 
> Patrick
> 
> 
> On 7/22/10 8:54 AM, Erik Wright wrote:
>> Hello,
>> 
>> Lately I have been working on counting sequence fragments in larger sets of sequences.  I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases.  Currently I am using the following method to count the number of "hits":
>> 
>> #### start ####
>> library(Biostrings)
>> fragments<- DNAStringSet(c("ACTG","AAAA"))
>> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC"))
>> 
>> for (i in 1:length(fragments)) {
>> 	counts<- vcountPattern(fragments[[i]],
>> 		sequence_set,
>> 		max.mismatch=1)
>> 	hits<- length(which(counts>  0))
>> 	print(hits)
>> }
>> #### end ####
>> 
>> This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner?  Does anyone have a suggestion for how I can accomplish the same task more efficiently?
>> 
>> Thanks!,
>> Erik
>> 
>> 
>> 
>> 
>>   
>>> sessionInfo()
>>>     
>> R version 2.11.0 (2010-04-22)
>> x86_64-apple-darwin9.8.0
>> 
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> other attached packages:
>> [1] Biostrings_2.16.0 IRanges_1.6.0
>> 
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.8.0
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>   
>