[BioC] Biostrings - vcountPattern optimization

Thu Jul 22 17:54:28 CEST 2010

Hello,

Lately I have been working on counting sequence fragments in larger sets of sequences.  I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases.  Currently I am using the following method to count the number of "hits":

#### start ####
library(Biostrings)
fragments <- DNAStringSet(c("ACTG","AAAA"))
sequence_set <- DNAStringSet(c("TAGACATGAC","ACCAAATGAC"))

for (i in 1:length(fragments)) {
	counts <- vcountPattern(fragments[[i]],
		sequence_set,
		max.mismatch=1)
	hits <- length(which(counts > 0))
	print(hits)
}
#### end ####

This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner?  Does anyone have a suggestion for how I can accomplish the same task more efficiently?

Thanks!,
Erik

> sessionInfo()
R version 2.11.0 (2010-04-22) 
x86_64-apple-darwin9.8.0 

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.16.0 IRanges_1.6.0    

loaded via a namespace (and not attached):
[1] Biobase_2.8.0