[BioC] advice on Biostrings

Wed Feb 22 01:58:54 CET 2006

Hi Rafael,

Comparing the speed of

     > grep("GC", y, fixed=TRUE)

with the speed of:

     > matchDNAPattern("GC", DNAString(y))

is a little unfair since the former will stop searching after it founds
the first occurence of "GC" (which is likely to happen very early since
nchar("GC") is only 2, probably in the first 10 or 20 letters of a random
TGCA string), while the latter will process the entire string in order
to count the total number of occurences of "GC".

So yes, grep is faster and is all what you need as long as you are not
interested in counting the number of matches (or retrieving their offsets).
On my system:

     > library(Biostrings)
     > y <- scan(file="bigrandomTGCA.txt", what="")
    Read 1 item
     > nchar(y)
    [1] 10000000
     > system.time(grep("GC", y, fixed=TRUE))
    [1] 0.01 0.00 0.00 0.00 0.00
     > dy <- DNAString(y)
     > system.time(length(matchDNAPattern("GC", dy)))
    [1] 0.07 0.01 0.08 0.00 0.00

Now if you need to count the number of matches, using length(strsplit()) 
might
be faster than matchDNAPattern() on small strings (nchar < 5000) but it will
definetly be __much__ slower on big strings:

     > nchar(y2)
    [1] 18314
     > system.time(length(strsplit(y2, "A", fixed=TRUE)[[1]]))
    [1] 0.08 0.00 0.09 0.00 0.00
     > dy2 = DNAString(y2)
     > system.time(length(matchDNAPattern("A", dy2)))
    [1] 0.02 0.00 0.01 0.00 0.00

Don't even try strsplit() on a 10 millions character string: it will 
take forever
and you won't be able to interrupt with CTRL C...

Regards,

H.

Rafael A Irizarry wrote:

> hi im using biostrings to count base content as well as pair of bases
> content. im using the following sniped of code:
>
>
> ###pmseq is a vector of character strings (not of the same nchar).
> tmp <- sapply(pmseq,function(x){
> y = DNAString(x)
> c(alphabetFrequency(y)[2:5], ##count A,T,G,C
> length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y)))
> ##count GC or CG
> })
>
> it is painfully slow. strsplit and grep were much faster for the first
> part (counting bases) but the using grep for the second part was not
> straight forward.
>
> any suggestions?
>
> -r
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor

-- 
------------------------
Hervé Pagès
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319