[BioC] advice on Biostrings

Tue Feb 21 22:19:20 CET 2006

hi im using biostrings to count base content as well as pair of bases 
content. im using the following sniped of code:

###pmseq is a vector of character strings (not of the same nchar).
tmp <- sapply(pmseq,function(x){
  y = DNAString(x)
  c(alphabetFrequency(y)[2:5], ##count A,T,G,C
    length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) 
##count GC or CG
})

it is painfully slow. strsplit and grep were much faster for the first 
part (counting bases) but the using grep for the second part was not 
straight forward.

any suggestions?

-r