[R] Help optimizing EMD::extrema()

Mon Feb 14 05:42:55 CET 2011

There are lots of modifications to make it faster.
E.g., replace the call to 'sequence' by a call to
'Sequence' and define Sequence by

Sequence <- function(nvec) {
   seq_len(sum(nvec)) - rep(cumsum(c(0L,nvec[-length(nvec)])), nvec)
}

That seemed to make the time for extrema(sample(-3:3,size=n,replace=TRUE))
linear in n (for n up to 4 million or so), while my original version
was looking quadratic out there.

It also seems like computing the distance to the end of
the current flat run could be simplified.

At this point there may well be other things in
the EMD code that take more time so speeding
extrema() further may not be worthwhile.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -----Original Message-----
> From: mike.lwrnc at gmail.com [mailto:mike.lwrnc at gmail.com] On 
> Behalf Of Mike Lawrence
> Sent: Sunday, February 13, 2011 7:08 PM
> To: William Dunlap
> Cc: r-help at lists.R-project.org; donghoh.kim at gmail.com
> Subject: Re: [R] Help optimizing EMD::extrema()
> 
> Wow, great work! This makes a huge difference (i.e. 5 mins to process
> an electrode instead of 5 hours).
> 
> I believe that the ndata and ndatam1 are there to let EMD::emd() and
> EMD::extractimf() (the functions that call EMD::extrema()) implement
> various boundary corrections. I personally don't intend on examining
> the start or end in great detail, so I think the uncorrected version
> is fine for my application.
> 
> Thanks again!
> 
> Mike
> 
> 
> On Sun, Feb 13, 2011 at 6:57 PM, William Dunlap 
> <wdunlap at tibco.com> wrote:
> > I did not try to emulate the ndata nad ndatam1 arguments
> > to extrema(), as I didn't see what they were for.  The
> > help file simply says they are the length of the first
> > argument and that minus 1 and that is what their default
> > values are.  If they do not have their default values
> > then extrema() frequently dies.  You could add them them
> > to the argument list and not use them, or check that
> > they are what they default to, as in
> >   function(x, ndata, ndatam1) {
> >      stopifnot(length(x)==ndata, ndata-1==ndatam1)
> >      ... rest of code ...
> > If the check fails then someone needs to say or figure out
> > what they are for.
> >
> > Bill Dunlap
> > Spotfire, TIBCO Software
> > wdunlap tibco.com
> >
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org
> >> [mailto:r-help-bounces at r-project.org] On Behalf Of William Dunlap
> >> Sent: Sunday, February 13, 2011 2:08 PM
> >> To: Mike Lawrence; r-help at lists.R-project.org
> >> Subject: Re: [R] Help optimizing EMD::extrema()
> >>
> >> > -----Original Message-----
> >> > From: r-help-bounces at r-project.org
> >> > [mailto:r-help-bounces at r-project.org] On Behalf Of Mike Lawrence
> >> > Sent: Friday, February 11, 2011 9:28 AM
> >> > To: r-help at lists.R-project.org
> >> > Subject: [R] Help optimizing EMD::extrema()
> >> >
> >> > Hi folks,
> >> >
> >> > I'm attempting to use the EMD package to analyze some 
> neuroimaging
> >> > data (timeseries with 64 channels sampled across 1 million
> >> time points
> >> > within each of 20 people). I found that processing a single
> >> channel of
> >> > data using EMD::emd() took about 8 hours. Exploration 
> using Rprof()
> >> > suggested that most of the compute time was spent in 
> EMD::extrema().
> >> > Looking at the code for EMD:extrema(), I managed to find 
> one obvious
> >> > speedup (switching from employing rbind() to c()) and I 
> suspect that
> >> > there may be a way to further speed things up by 
> pre-allocating all
> >> > the objects that are currently being created with c(), but
> >> I'm having
> >> > trouble understanding the code sufficiently to know
> >> when/where to try
> >> > this and what sizes to set as the default pre-allocation
> >> length. Below
> >> > I include code that demonstrates the speedup I achieved by
> >> eliminating
> >> > calls to rbind(), and also demonstrates that only a few 
> calls to c()
> >> > seem to be responsible for most of the compute time. The files
> >> > "extrema_c.R" and "extrema_c2.R" are available at:
> >> > https://gist.github.com/822691
> >>
> >> Try the following new.extrema().  It is based on
> >> looking at runs in the data in a vectorized way.
> >> On my old laptop the running times for length(x)=2^(2:118)
> >> with EMD::extrema and new.extrema are
> >>        old.time new.time
> >> 4          0.00     0.00
> >> 8          0.00     0.00
> >> 16         0.00     0.00
> >> 32         0.00     0.00
> >> 64         0.00     0.00
> >> 128        0.00     0.00
> >> 256        0.00     0.00
> >> 512        0.02     0.00
> >> 1024       0.03     0.00
> >> 2048       0.06     0.01
> >> 4096       0.14     0.00
> >> 8192       0.37     0.02
> >> 16384      1.08     0.03
> >> 32768      3.64     0.06
> >> 65536     13.35     0.12
> >> 131072    48.42     0.25
> >> 262144   206.17     0.59
> >>
> >> isEndOfStrictlyIncreasingRun <- function(x) {
> >>     c(FALSE, diff(diff(x) > 0) < 0, FALSE)
> >> }
> >>
> >> isStartOfStrictlyIncreasingRun <- function(x) {
> >>     c(FALSE, diff(diff(x) <= 0) < 0, FALSE)
> >> }
> >>
> >> isEndOfStrictlyDecreasingRun <- function(x) {
> >>     isEndOfStrictlyIncreasingRun(-x)
> >> }
> >>
> >> isStartOfStrictlyDecreasingRun <- function(x) {
> >>     isStartOfStrictlyIncreasingRun(-x)
> >> }
> >>
> >> isStartOfZeroRun <- function(x) {
> >>     x==0 & c(TRUE, x[-length(x)]!=0)
> >> }
> >>
> >> nToEndOfCurrentFlatRun <- function(x) {
> >>     # for each element of x, how far to end of current
> >>     # run of equal values.
> >>     rev( sequence(rle(rev(x))$lengths) - 1L)
> >> }
> >>
> >> indexOfEndOfCurrentFlatRun <- function(x) {
> >>     # should be a more direct way of doing this, but this is pretty
> >> quick
> >>     seq_len(length(x)) + nToEndOfCurrentFlatRun(x)
> >> }
> >>
> >> new.extrema <- function(x) {
> >>     f <- indexOfEndOfCurrentFlatRun(x)
> >>     isMaxStart <- isEndOfStrictlyIncreasingRun(x) &
> >> isStartOfStrictlyDecreasingRun(x)[f]
> >>     maxindex <- cbind(which(isMaxStart), f[isMaxStart],
> >> deparse.level=0)
> >>
> >>     isMinStart <- isEndOfStrictlyDecreasingRun(x) &
> >> isStartOfStrictlyIncreasingRun(x)[f]
> >>     minindex <- cbind(which(isMinStart), f[isMinStart],
> >> deparse.level=0)
> >>
> >>
> >>     # zero-crossings are bit weird: Report either the
> >> before-zero/after-zero
> >>     # pair or the start and stop of a a run of zero's 
> (even if the run
> >> is
> >>     # not part of an actual zero-crossing).  Do them 
> separately then
> >> sort.
> >>     # Also, if the entire sequence never actually crosses 0, do
> >>     # not report the zero-touchings.
> >>     # Also, if length(x)==2, the original doesn't report a
> >> zero-crossing
> >> or
> >>     # a zero run.  We do not copy that.
> >>     if (max(x) > 0 && min(x) < 0) {
> >>         indexOfZeroCrossingStart <- which(c(abs(diff(sign(x)))==2,
> >> FALSE))
> >>         indexOfZeroCrossingEnd <- indexOfZeroCrossingStart + 1L
> >>         indexOfZeroRunStart <- which(isStartOfZeroRun(x))
> >>         indexOfZeroRunEnd <- f[indexOfZeroRunStart]
> >>         crossStart <- c(indexOfZeroCrossingStart, 
> indexOfZeroRunStart)
> >>         cross <- cbind(crossStart, c(indexOfZeroCrossingEnd,
> >> indexOfZeroRunEnd), deparse.level=0)[order(crossStart),, 
> drop=FALSE]
> >>     } else {
> >>         cross <- cbind(integer(), integer())
> >>     }
> >>     # extrema likes to return NULL instead of a zero-row matrix,
> >>     # so we follow it.  Zero-row matrices are easier to deal with.
> >>     list(
> >>          minindex=if (nrow(minindex)) minindex else NULL,
> >>          maxindex=if (nrow(maxindex)) maxindex else NULL,
> >>          nextreme=nrow(minindex) + nrow(maxindex),
> >>          cross=if(nrow(cross)) cross else NULL,
> >>          ncross=nrow(cross) # extrema() returns numeric, 
> not integer,
> >> here
> >>     )
> >> }
> >>
> >> Bill Dunlap
> >> Spotfire, TIBCO Software
> >> wdunlap tibco.com
> >>
> >> >
> >> > Any suggestions/help would be greatly appreciated.
> >> >
> >> >
> >> > #load the EMD library for the default version of extrema
> >> > library(EMD)
> >> >
> >> > #some data to process
> >> > values = rnorm(1e4)
> >> >
> >> > #profile the default version of extrema
> >> > Rprof(tmp <- tempfile())
> >> > temp = extrema(values)
> >> > Rprof()
> >> > summaryRprof(tmp)
> >> > #1.2s total with most time spend doing rbind
> >> > unlink(tmp)
> >> >
> >> > #load a rbind-free version of extrema
> >> > source('extrema_c.R')
> >> > Rprof(tmp <- tempfile())
> >> > temp = extrema_c(values)
> >> > Rprof()
> >> > summaryRprof(tmp) #much faster! .5s total
> >> > unlink(tmp)
> >> >
> >> > #still, it encounters slowdowns with lots of data
> >> > values = rnorm(1e5)
> >> > Rprof(tmp <- tempfile())
> >> > temp = extrema_c(values)
> >> > Rprof()
> >> > summaryRprof(tmp)
> >> > #44s total, hard to see what's taking up so much time
> >> > unlink(tmp)
> >> >
> >> > #load an rbind-free version of extrema that labels each 
> call to c()
> >> > source('extrema_c2.R')
> >> > Rprof(tmp <- tempfile())
> >> > temp = extrema_c2(values)
> >> > Rprof()
> >> > summaryRprof(tmp)
> >> > #same time as above, but now we see that it spends more time in
> >> > certain calls to c() than others
> >> > unlink(tmp)
> >> >
> >> > ______________________________________________
> >> > R-help at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >> > http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, 
> reproducible code.
> >> >
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
>