[BioC] speed of runmed on SimpleRleList

Wed Feb 20 00:51:21 CET 2013

Hi Janet,

The culprit seems to be the call to smoothEnds() at the end of the
"runmed" method for Rle objects:

   > selectMethod("runmed", "Rle")
   Method Definition:

   function (x, k, endrule = c("median", "keep", "drop", "constant"),
       algorithm = NULL, print.level = 0)
   {
       if (!all(is.finite(as.vector(x))))
           stop("NA/NaN/Inf not supported in runmed,Rle-method")
       endrule <- match.arg(endrule)
       n <- length(x)
       k <- normargRunK(k = k, n = n, endrule = endrule)
       i <- (k + 1L)%/%2L
       ans <- runq(x, k = k, i = i)
       if (endrule == "constant") {
           runLength(ans)[1L] <- runLength(ans)[1L] + (i - 1L)
           runLength(ans)[nrun(ans)] <- runLength(ans)[nrun(ans)] +
               (i - 1L)
       }
       else if (endrule != "drop") {
           ans <- c(head(x, i - 1L), ans, tail(x, i - 1L))
           if (endrule == "median") {
               ans <- smoothEnds(ans, k = k)
           }
       }
       ans
   }

Not too surprisingly, the complexity of smoothEnds() doesn't depend on
the length of its first arg:

   > system.time(smoothEnds(Rle(0L, 10), k=5))
      user  system elapsed
     0.028   0.000   0.026
   > system.time(smoothEnds(Rle(0L, 10000), k=5))
      user  system elapsed
     0.028   0.000   0.026

but is linear w.r.t to the value of its 2nd ag:

   > system.time(smoothEnds(Rle(0L, 10000), k=51))
      user  system elapsed
     0.280   0.000   0.282
   > system.time(smoothEnds(Rle(0L, 10000), k=511))
      user  system elapsed
     2.836   0.000   2.841

In other words, runmed() will be pretty fast on a full genome
coverage, but you will pay an additional fixed price for each
chromosome in the genome, and this fixed priced only depends
on the length of the median window.

Note that the "smoothEnds" method for Rle objects is just
re-using the exact same code as stats::smoothEnds(), which is not
very efficient on Rle's. So we'll look into ways to improve this.

Cheers,
H.

On 02/15/2013 12:27 PM, Janet Young wrote:
> Hi there,
>
> I've been using runmean on some coverage objects - nice.   Runmed also seems useful, but also seems very slow (oddly so) - I'm wondering whether there's some easy improvement could be made there?  Same issue with devel version and an older version.  All should be explained in the code below (I hope).
>
> thanks very much,
>
> Janet
>
> -------------------------------------------------------------------
>
> Dr. Janet Young
>
> Malik lab
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Avenue N., C3-168,
> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>
> tel: (206) 667 1471 fax: (206) 667 6524
> email: jayoung  ...at...  fhcrc.org
>
> -------------------------------------------------------------------
>
>
>
> library(GenomicRanges)
>
> ### a small GRanges object (example from ?GRanges)
> seqinfo <- Seqinfo(paste0("chr", 1:3), c(1000, 2000, 1500), NA, "mock1")
> gr2 <-GRanges(seqnames =
>            Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
>            ranges = IRanges(1:10, width = 10:1, names = head(letters,10)),
>            strand = Rle(strand(c("-", "+", "*", "+", "-")),c(1, 2, 2, 3, 2)),
>            score = 1:10,
>            GC = seq(1, 0, length=10),
>            seqinfo=seqinfo)
> gr2
>
> cov <- coverage(gr2)
>
> ### runmed is slow! (for you Hutchies: this is on a rhino machine)
> ### I'm really trying to run this on some much bigger objects (whole genome coverage), where the slowness is more of an issue.
> system.time(runmed(cov, 51))
> #   user  system elapsed
> #  1.120   0.016   1.518
>
> ### runmean is fast
> system.time(runmean(cov, 51))
> #    user  system elapsed
> #   0.008   0.000   0.005
>
> sessionInfo()
>
>
> R Under development (unstable) (2012-10-03 r60868)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] GenomicRanges_1.11.29 IRanges_1.17.32       BiocGenerics_0.5.6
>
> loaded via a namespace (and not attached):
> [1] stats4_2.16.0
>
>
> ## see a similar issue on my Mac, using older R
>
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] GenomicRanges_1.10.1 IRanges_1.16.2       BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] parallel_2.15.1 stats4_2.15.1
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319