[BioC] advice on Biostrings

Wed Feb 22 18:20:49 CET 2006

Actually, this function in Biostrings uses one of those fiendishly
clever algorithms coded in C, and as a consequence is *very* fast: on
my puny windows laptop, searching a random sequence of 10 million bp
for "GC" returns more than 1/2 million matches in about a 10th of a
second:

> seq <- paste( dna[ runif( 10000000, min=1, max=5 )], collapse="" )
> system.time( res <- length(matchDNAPattern( "GC", seq ))); res
[1] 0.13 0.00 0.12   NA   NA
[1] 626393

Maybe your speed issues are related to R, or to memory management, or
'user error' ;) ?. Here are suggestions:

* If the match pattern "GC" is a character, it is converted to a
DNAString by matchDNAPattern. Short-circuit this step by

GC <- DNAString("GC")

outside the sapply, and then

matchDNAPattern( GC, y )

This seems to actually make quite a bit of difference:

> seqs <- lapply( 1:1000, function(i) paste( dna[ runif( 100, min=1, max=5 )], collapse="" ))
> system.time( res <- sapply( seqs, function(s) length(matchDNAPattern( "GC", s ))))
[1] 27.52  0.01 27.61    NA    NA
> GC <- DNAString("GC")
> system.time( res <- sapply( seqs, function(s) length(matchDNAPattern( GC, s ))))
[1] 16.53  0.00 16.56    NA    NA
> 

* Make sure your function is actually doing what you want. My mistake
in exploring this was to write

> seq <- paste( dna[ runif( 10000000, min=1, max=5 )])

(i.e., without the 'collapse' argument). this (tried to) create a
vector with 10 million single characters. The sapply would then have
looped over these, rather than the single string I intended!

* Do the calculations in parallel

> library(snow)
> cl <- makeCluster(4, "MPI")
> clusterEvalQ(cl, library(Biostrings))
> system.time(res <- parSapply( cl, seqs, function(s) length(matchDNAPattern( GC, s ))))
[1] 0.01 0.00 4.56 0.00 0.00

!

* Perhaps your R is spending a lot of time swapping virtual memory to
disk. Try garbage collection

gc()

before performing the analysis. If this doesn't halp, and it seems
like memory really is an issue, then perhaps reading the strings in a
chunk at a time would be necessary. You might alos try separating the
sapply into three separate calls, one for each function. I doubt this
is really a problem, unless you're dealing with *very* large
collections of sequence.

* I found that analyzing some number of bp in a few strings is much
faster than analyzing the same total number of bp in many strings,
i.e., the bottleneck is actually in sapply. There are apparently
tricks to enhancing sapply speed, though my exploration in the present
context didn't provide any dramatic results.

Hope these suggestions help...

Martin

Wolfgang Huber <huber at ebi.ac.uk> writes:

> Rafael A Irizarry wrote:
>> hi im using biostrings to count base content as well as pair of bases 
>> content. im using the following sniped of code:
>> 
>
> Hi Rafa,
>
> to count symbols in character vectors, matchprobes:basecontent is fast:
>
> library(matchprobes)
> v   = c("AAACT", "GGGTT", "ggAtT")
> bc  = basecontent(v)
> print.default(bc)
> bc[,"C"]+bc[,"G"]
>
> and if there is interest I'd be happy amend the C code to also count 
> pairs of bases (or you could, it is not terribly complicated).
>
>   Cheers
>   Wolfgang
>
>> 
>> ###pmseq is a vector of character strings (not of the same nchar).
>> tmp <- sapply(pmseq,function(x){
>>   y = DNAString(x)
>>   c(alphabetFrequency(y)[2:5], ##count A,T,G,C
>>     length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y))) 
>> ##count GC or CG
>> })
>> 
>> it is painfully slow. strsplit and grep were much faster for the first 
>> part (counting bases) but the using grep for the second part was not 
>> straight forward.
>> 
>> any suggestions?
>
>
> -------------------------------------
> Wolfgang Huber
> European Bioinformatics Institute
> European Molecular Biology Laboratory
> Cambridge CB10 1SD
> England
> Phone: +44 1223 494642
> Fax:   +44 1223 494486
> Http:  www.ebi.ac.uk/huber
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor