[BioC] IRanges: Need help speeding up sliding window analysis

Martin Morgan mtmorgan at fhcrc.org
Fri Sep 24 14:56:24 CEST 2010


On 09/24/2010 03:11 AM, Michael Dondrup wrote:
> Hi,
> 
> I need some help with speeding up a sliding window analysis on an Rle object of length > 1 million.
> I am using functions 'successiveViews' with negative gap width and 'viewApply' vs. viewMeans.
> My goal is to apply a discrete differential operator that computes the  difference between the 'left'
> half of the window and the 'right', aka. a cheap discrete numeric first order differentiation.
> 
> What I found is: viewMeans(x) << viewApply(x, mean) << viewApply(x, diff.op) in terms of time, example below.
> Is there a way to pimp this code to make it work on the genome scale? I appreciate your input, I am confident there 
> is a better way to do it.

maybe

  win <- 30
  diff(cumsum(rle), win) / win

for numeric (not integer) rle, though there might be rounding problems
if cumsum gets large. A strategy might be to break the Rle into regions
separated by islands of at least 'win' 0's (using runLength / runValue
to identify candidate break points), which allows one to reset the
cumsum. Some inspiration might come from
http://www.mail-archive.com/r-help@r-project.org/msg75280.html.

Also the end points might need fiddling (e.g., by padding rle with 'win'
trailing zeros, which is I think in effect what successiveViews does.

Martin

> 
> Thank you very much 
> Michael
> 
> Code example:
> 
> diff.op <- function(x, lrprop=1/2) { 
>   len = length(x)
>   i = ceiling(len*lrprop)  
>   (sum(x[i:len]) - sum(x[1:i])) / len
> }
> 
> sliding.window.apply <-  function(object, width, fun, ...) {
>   x <- trim(successiveViews(subject=object, width=rep(width,  ceiling(length(object))  ), gap=-width+1))
>   return (Rle(viewApply(x, fun, ...)))
> }
> 
> sliding.window.mean <-  function(object, width) {
>   x <- trim(successiveViews(subject=object, width=rep(width,  ceiling(length(object))  ), gap=-width+1))
>   return (viewMeans(x))
> }
> 
> rle <- Rle(1:10000)
> 
>> system.time(sliding.window.mean(rle, 30))
>    user  system elapsed 
>   0.036   0.004   0.098 
>> system.time(sliding.window.apply(rle,  30, mean))
>    user  system elapsed 
>   4.380   0.065   6.010 
>> system.time(sliding.window.apply(rle,  30, diff.op))
>    user  system elapsed 
>  38.857   0.204  39.127 
> 
>> sessionInfo()
> R version 2.11.1 (2010-05-31) 
> x86_64-apple-darwin9.8.0 
> 
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
> [1] IRanges_1.6.6
> 
> loaded via a namespace (and not attached):
>  [1] annotate_1.26.1      AnnotationDbi_1.10.2 Biobase_2.8.0        DBI_0.2-5            DESeq_1.0.4         
>  [6] genefilter_1.30.0    geneplotter_1.26.0   grid_2.11.1          RColorBrewer_1.0-2   RSQLite_0.9-1       
> [11] splines_2.11.1       survival_2.35-8      xtable_1.5-6        
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list