[BioC] IRanges: Need help speeding up sliding window analysis

Michael Dondrup Michael.Dondrup at uni.no
Wed Sep 29 13:00:03 CEST 2010


Hi Patrick,

just wanted to say thank you, that was exactly what I needed and it's much
faster and more robust than my initial solution.
It can be used f.e. to detect locations of strong changes in a
numeric Rle vector, eg. a coverage, I didn't know that 'diff' could be used on Rle's like this.

Best
Michael



Am Sep 24, 2010 um 5:51 PM schrieb Patrick Aboyoun:

> Michael,
> The IRanges package contains a number of built-in running window functions (runsum, runmean, runwtsum, runq) that you might want to consider for this operation. Also, if I am understanding what you are trying to do correctly, something like
> 
> width <- 30
> halfWidth <- width/2
> halfWidthSums <- runsum(rle, halfWidth)
> diff(halfWidthSums, halfWidth)/width
> 
> should fit the bill.
> 
> 
> Cheers,
> Patrick
> 
> 
> Quoting Martin Morgan <mtmorgan at fhcrc.org>:
> 
>> On 09/24/2010 03:11 AM, Michael Dondrup wrote:
>>> Hi,
>>> 
>>> I need some help with speeding up a sliding window analysis on an  Rle object of length > 1 million.
>>> I am using functions 'successiveViews' with negative gap width and  'viewApply' vs. viewMeans.
>>> My goal is to apply a discrete differential operator that computes  the  difference between the 'left'
>>> half of the window and the 'right', aka. a cheap discrete numeric  first order differentiation.
>>> 
>>> What I found is: viewMeans(x) << viewApply(x, mean) << viewApply(x,  diff.op) in terms of time, example below.
>>> Is there a way to pimp this code to make it work on the genome  scale? I appreciate your input, I am confident there
>>> is a better way to do it.
>> 
>> maybe
>> 
>>  win <- 30
>>  diff(cumsum(rle), win) / win
>> 
>> for numeric (not integer) rle, though there might be rounding problems
>> if cumsum gets large. A strategy might be to break the Rle into regions
>> separated by islands of at least 'win' 0's (using runLength / runValue
>> to identify candidate break points), which allows one to reset the
>> cumsum. Some inspiration might come from
>> http://www.mail-archive.com/r-help@r-project.org/msg75280.html.
>> 
>> Also the end points might need fiddling (e.g., by padding rle with 'win'
>> trailing zeros, which is I think in effect what successiveViews does.
>> 
>> Martin
>> 
>>> 
>>> Thank you very much
>>> Michael
>>> 
>>> Code example:
>>> 
>>> diff.op <- function(x, lrprop=1/2) {
>>>  len = length(x)
>>>  i = ceiling(len*lrprop)
>>>  (sum(x[i:len]) - sum(x[1:i])) / len
>>> }
>>> 
>>> sliding.window.apply <-  function(object, width, fun, ...) {
>>>  x <- trim(successiveViews(subject=object, width=rep(width,   ceiling(length(object))  ), gap=-width+1))
>>>  return (Rle(viewApply(x, fun, ...)))
>>> }
>>> 
>>> sliding.window.mean <-  function(object, width) {
>>>  x <- trim(successiveViews(subject=object, width=rep(width,   ceiling(length(object))  ), gap=-width+1))
>>>  return (viewMeans(x))
>>> }
>>> 
>>> rle <- Rle(1:10000)
>>> 
>>>> system.time(sliding.window.mean(rle, 30))
>>>   user  system elapsed
>>>  0.036   0.004   0.098
>>>> system.time(sliding.window.apply(rle,  30, mean))
>>>   user  system elapsed
>>>  4.380   0.065   6.010
>>>> system.time(sliding.window.apply(rle,  30, diff.op))
>>>   user  system elapsed
>>> 38.857   0.204  39.127
>>> 
>>>> sessionInfo()
>>> R version 2.11.1 (2010-05-31)
>>> x86_64-apple-darwin9.8.0
>>> 
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>> 
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>> 
>>> other attached packages:
>>> [1] IRanges_1.6.6
>>> 
>>> loaded via a namespace (and not attached):
>>> [1] annotate_1.26.1      AnnotationDbi_1.10.2 Biobase_2.8.0         DBI_0.2-5            DESeq_1.0.4
>>> [6] genefilter_1.30.0    geneplotter_1.26.0   grid_2.11.1           RColorBrewer_1.0-2   RSQLite_0.9-1
>>> [11] splines_2.11.1       survival_2.35-8      xtable_1.5-6
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:  http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:  http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
> 
> 
> 



More information about the Bioconductor mailing list