[R] finding peaks in a simple dataset with R

Thu Nov 24 16:06:52 CET 2005

>>>>> "Gabor" == Gabor Grothendieck <ggrothendieck at gmail.com>
>>>>>     on Wed, 23 Nov 2005 16:51:17 -0500 writes:

    Gabor> One idea might be, rather than have a peaks function, enhance
    Gabor> rle so that it optionally produces a third component with the
    Gabor> peak information, perhaps 1, 0, -1 for peak, neither and trough.
    Gabor> This would avoid any problems with ties since the output of rle is
    Gabor> based on runs.

that's an interesting idea.   Contributions?

{see also my comments further below}

    Gabor> On 11/23/05, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
    >> On 11/23/05, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
    >> > >>>>> "Marc" == Marc Kirchner <marc.kirchner at iwr.uni-heidelberg.de>
    >> > >>>>>     on Wed, 23 Nov 2005 14:33:28 +0000 writes:
    >> >
    >> >    >>
    >> >    >> I wonder if we shouldn't polish that a bit and add to R's
    >> >    >> standard 'utils' package.
    >> >    >>
    >> >
    >> >    Marc> Hm, I figured out there are (at least) two versions out there, one being
    >> >    Marc> the "original" idea and a modification.
    >> >
    >> >    Marc> === Petr Pikal in 2001 (based on Brian Ripley's idea)==
    >> >    Marc> peaks <- function(series, span=3) {
    >> >    Marc> z <- embed(series, span)
    >> >    Marc> result <- max.col(z) == 1 + span %/% 2
    >> >    Marc> result
    >> >    Marc> }
    >> >
    >> >    Marc> versus
    >> >
    >> >    Marc> === Petr Pikal in 2004 ==
    >> >    Marc> peaks2<-function(series,span=3) {
    >> >    Marc> z <- embed(series, span)
    >> >    Marc> s <- span%/%2
    >> >    Marc> v<- max.col(z) == 1 + s
    >> >    Marc> result <- c(rep(FALSE,s),v)
    >> >    Marc> result <- result[1:(length(result)-s)]
    >> >    Marc> result
    >> >    Marc> }
    >> >
    >> > Thank you, Marc,
    >> >
    >> >    Marc> Comparison shows
    >> >    >> peaks(c(1,4,1,1,6,1,5,1,1),3)
    >> >    Marc> [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
    >> >    Marc> which is a logical vector for elements 2:N-1 and
    >> >
    >> >    >> peaks2(c(1,4,1,1,6,1,5,1,1),3)
    >> >    Marc> [1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
    >> >    Marc> which is a logical vector for elements 1:N-2.
    >> >
    >> >    Marc> As I would expect to "lose" (span-1)/2 elements on each side
    >> >    Marc> of the vector, to me the 2001 version feels more natural.
    >> >
    >> > I think for the function to be more useful it the result should
    >> > have the original vector length and hence I'd propose to also
    >> > pad with FALSE at the upper end.
    >> >
    >> >    Marc> Also, both "suffer" from being non-deterministic in the
    >> >    Marc> multiple-maxima-case (the two 4s here)
    >> >
    >> > yes, of course, because of max.col().
    >> > Note that Venables & Ripley would consider this to be rather a feature.
    >> >
    >> >    Marc> This could (should?) be fixed by modifying the call to max.col()
    >> >    Marc> result <- max.col(z, "first") == 1 + span %/% 2;
    >> >
    >> > Actually I think it should become an option, but I'd use "first"
    >> > as default.
    >> >
    >> >    Marc> Just my two cents,
    >> >
    >> > Thank you.
    >> >
    >> > Here is my current proposal which also demonstrates why it's
    >> > useful to pad with FALSE :
    >> >
    >> > peaks <-function(series, span = 3, ties.method = "first") {
    >> >    if((span <- as.integer(span)) %% 2 != 1) stop("'span' must be odd")
    >> >    z <- embed(series, span)
    >> >    s <- span %/% 2
    >> >    v <- max.col(z, ties.method = ties.method) == s + 1:1
    >> >    pad <- rep(FALSE, s)
    >> >    c(pad, v, pad)
    >> > }
    >> >
    >> > y <- c(1,4,1,1,6,1,5,1,1) ; (ii <- which(peaks(y))); y[ii]
    >> > ##- [1] 2 5 7
    >> > ##- [1] 4 6 5
    >> >
    >> > set.seed(7)
    >> > y <- rpois(100, lambda = 7)
    >> > py <- peaks(y)
    >> > plot(y, type="o", cex = 1/4, main = "y and peaks(y,3)")
    >> > points(seq(y)[py], y[py], col = 2, cex = 1.5)
    >> >
    >> > p7 <- peaks(y,7)
    >> > points(seq(y)[p7], y[p7], col = 3, cex = 2)
    >> > mtext("peaks(y,7)", col=3)
    >> 
    >> I think ties are still problems with this approach
    >> as:
    >> 
    >> set.seed(1)

[that's not needed for the above peaks()!]

    >> peaks( c(1,2,2,2,3), 3 )
    >> 
    >> gives a peak in the 2,2,2 stretch.

Yes, thank you for the example, I've noticed similar behavior in
the plots I gave above.

    >> Also NA would seem to be a better pad than false or maybe
    >> it should be specifiable including whether there is
    >> padding at all.

NA's have the big drawback of producing a logical vector that
can NOT be used for subsetting -- and subsetting was exactly a main
reason for the padding...

I agree however that an option to "not pad" is sensible.

    >> The zoo rapply and rollmax functions which can also be used
    >> to specify a similar naive peak function have na.pad=
    >> and align= arguments.  Also any peaks function should
    >> be generic so that various time series classes can implement
    >> their own methods and in the case of irregularly spaced series
    >> note that there are two possibilities for span, the time distance
    >> and the number of points, and they are not the same.
    >> 
    >> It might also be nice to be able to get back peaks, troughs or
    >> both via  1,0,-1 in the output.
    >> 

    Gabor> ______________________________________________
    Gabor> R-help at stat.math.ethz.ch mailing list
    Gabor> https://stat.ethz.ch/mailman/listinfo/r-help
    Gabor> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html