[R] matching a sequence in a vector?

Petr Savicky savicky at cs.cas.cz
Thu Feb 16 09:01:49 CET 2012


On Wed, Feb 15, 2012 at 08:12:32PM -0500, Gabor Grothendieck wrote:
> On Tue, Feb 14, 2012 at 11:17 PM, Redding, Matthew
> <Matthew.Redding at deedi.qld.gov.au> wrote:
> > I've been trawling through the documentation and listserv archives on this topic -- but
> > as yet have not found a solution.  I'm sure this is pretty simple with R, but I cannot work out how without
> > resorting to ugly nested loops.
> >
> > As far as I can tell, grep, match, and %in% are not the correct tools.
> >
> > Question:
> > given these vectors --
> > patrn <- c(1,2,3,4)
> > exmpl <- c(3,3,4,2,3,1,2,3,4,8,8,23,1,2,3,4,4,34,4,3,2,1,1,2,3,4)
> >
> > how do I get the desired answer by finding the occurence of the pattern and returning the starting indices:
> > 6, 13, 23
> >
> 
> Here is a one-liner:
> 
> library(zoo)
> which(rollapply(exmpl, 4, identical, patrn, fill = FALSE, align = "left"))

Hi.

There were several solutions in this thread. Their speed differs
quite significantly. Here is a comparison.

  patrn <- 1:4
  exmpl <- sample(1:4, 10000, replace=TRUE)
  
  occur1 <- function(patrn, exmpl)
  {
    m <- length(patrn)
    n <- length(exmpl)
    candidate <- seq.int(length=n-m+1)
    for (i in seq.int(length=m)) {
        candidate <- candidate[patrn[i] == exmpl[candidate + i - 1]]
    }
    candidate
  }
  
  occur2 <- function(patrn, exmpl)
  {
    patrn.rev <- rev(patrn)
    w <- embed(exmpl,length(patrn))
    which(apply(w,1,function(r) all(r == patrn.rev)))
  }
  
  occur3 <- function(patrn, exmpl)
  {
    patrn.rev <- rev(patrn)
    w <- embed(exmpl,length(patrn))
    which(rowSums(w == rep(rev(patrn), each=nrow(w))) == ncol(w))
  }
  
  occur4 <- function(patrn, exmpl)
  {
    # requires patrn without duplicates
    n = length(patrn)
    r = rle(diff(match(exmpl, patrn)) == 1L)
    cumsum(r$length)[r$values & r$length == (n - 1L)] - (n - 2L)
  }
  
  occur5 <- function(patrn, exmpl)
  {
    which( sapply( 1:(length(exmpl)-length(patrn)+1), function(i) isTRUE( all.equal( patrn, exmpl[i + 0:(length(patrn)-1) ] ) ) ) )
  }
  
  occur6 <- function(patrn, exmpl)
  {
    indx <- embed(rev(seq_along(exmpl)), length(patrn))
    matches <- apply(indx, 1, function(.indx){
        all(exmpl[.indx] == patrn)
    })
    rev(indx[matches, 1L])
  }
  
  occur7 <- function(patrn, exmpl)
  {
    which(rollapply(exmpl, length(patrn), identical, patrn, fill = FALSE, align = "left"))
  }
  
  library(zoo)
  
  t1 <- system.time( out1 <- occur1(patrn, exmpl) )
  t2 <- system.time( out2 <- occur2(patrn, exmpl) )
  t3 <- system.time( out3 <- occur3(patrn, exmpl) )
  t4 <- system.time( out4 <- occur4(patrn, exmpl) )
  t5 <- system.time( out5 <- occur5(patrn, exmpl) )
  t6 <- system.time( out6 <- occur6(patrn, exmpl) )
  t7 <- system.time( out7 <- occur7(patrn, exmpl) )
  
  print(identical(out1, out2))
  print(identical(out1, out3))
  print(identical(out1, out4))
  print(identical(out1, out5))
  print(identical(out1, out6))
  print(identical(out1, out7))
  print(rbind(t1, t2, t3, t4, t5, t6, t7))

The output was

  [1] TRUE
  [1] TRUE
  [1] TRUE
  [1] TRUE
  [1] TRUE
  [1] TRUE
     user.self sys.self elapsed user.child sys.child
  t1     0.001        0   0.001          0         0
  t2     0.062        0   0.061          0         0
  t3     0.002        0   0.002          0         0
  t4     0.001        0   0.001          0         0
  t5     1.749        0   1.749          0         0
  t6     0.068        0   0.068          0         0
  t7     0.172        0   0.172          0         0

Petr Savicky.



More information about the R-help mailing list