[BioC] advice on Biostrings

Wed Feb 22 11:06:11 CET 2006

Rafael A Irizarry wrote:
> hi im using biostrings to count base content as well as pair of bases 
> content. im using the following sniped of code:
> 

Hi Rafa,

to count symbols in character vectors, matchprobes:basecontent is fast:

library(matchprobes)
v   = c("AAACT", "GGGTT", "ggAtT")
bc  = basecontent(v)
print.default(bc)
bc[,"C"]+bc[,"G"]

and if there is interest I'd be happy amend the C code to also count 
pairs of bases (or you could, it is not terribly complicated).

  Cheers
  Wolfgang

> 
> ###pmseq is a vector of character strings (not of the same nchar).
> tmp <- sapply(pmseq,function(x){
>   y = DNAString(x)
>   c(alphabetFrequency(y)[2:5], ##count A,T,G,C
>     length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) 
> ##count GC or CG
> })
> 
> it is painfully slow. strsplit and grep were much faster for the first 
> part (counting bases) but the using grep for the second part was not 
> straight forward.
> 
> any suggestions?

-------------------------------------
Wolfgang Huber
European Bioinformatics Institute
European Molecular Biology Laboratory
Cambridge CB10 1SD
England
Phone: +44 1223 494642
Fax:   +44 1223 494486
Http:  www.ebi.ac.uk/huber