[BioC] Biostrings - vcountPattern optimization
Patrick Aboyoun
paboyoun at fhcrc.org
Thu Jul 22 19:11:28 CEST 2010
Erik,
Have you tried vcountPDict? It will use an Aho - Corasick matching
algorithm
(http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm)
that is pretty fast, albeit memory intensive.
library(Biostrings)
fragments<- DNAStringSet(c("ACTG","AAAA"))
sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC"))
pdict<- PDict(fragments)
counts<- vcountPDict(pdict, sequence_set)
> counts
[,1] [,2]
[1,] 0 0
[2,] 0 0
> sessionInfo()
R version 2.12.0 Under development (unstable) (2010-07-18 r52554)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.17.26 IRanges_1.7.13
loaded via a namespace (and not attached):
[1] Biobase_2.9.0 tools_2.12.0
Patrick
On 7/22/10 8:54 AM, Erik Wright wrote:
> Hello,
>
> Lately I have been working on counting sequence fragments in larger sets of sequences. I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases. Currently I am using the following method to count the number of "hits":
>
> #### start ####
> library(Biostrings)
> fragments<- DNAStringSet(c("ACTG","AAAA"))
> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC"))
>
> for (i in 1:length(fragments)) {
> counts<- vcountPattern(fragments[[i]],
> sequence_set,
> max.mismatch=1)
> hits<- length(which(counts> 0))
> print(hits)
> }
> #### end ####
>
> This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner? Does anyone have a suggestion for how I can accomplish the same task more efficiently?
>
> Thanks!,
> Erik
>
>
>
>
>
>> sessionInfo()
>>
> R version 2.11.0 (2010-04-22)
> x86_64-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Biostrings_2.16.0 IRanges_1.6.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list