[R] How to find frequent sequences.

Fri Jul 13 10:36:55 CEST 2012

On Thu, Jul 12, 2012 at 03:51:54PM -0500, Vineet Shukla wrote:
> I have independent event sequences for example as follows :
> 
> Independent event sequence   1 : A , B , C , D
> Independent event sequence   2 : A, C , B
> Independent event sequence   3 :D, A, B, X,Y, Z
> Independent event sequence   4 :C,A,A,B
> Independent event sequence   5 :B,A,D
> 
> I want to able to find that most common sequence patters as
> 
> {A, B }  = > 3
> from lines 1,3,5.
> 
> Pls note that A,C,B must not be considered because C comes in between
> and line 5 also must not be considered because order of A,B is reversed.

Hi.

If i understand correctly, the first sequence contains patterns

  AB, BC, CD.

Using this interpretation, AB occurs at lines 1,3,4 and not 1,3,5.
Is this correct?

If some sequence contains several ocurrences of a pattern, for example,
the sequence

   A, B, A, B

contains AB twice, then it is counted only once?

If this is correct, then try the following

  # your input list
  lst <- list(
  c("A", "B", "C", "D"),
  c("A", "C", "B"),
  c("D", "A", "B", "X", "Y", "Z"),
  c("C", "A", "A", "B"),
  c("B", "A", "D"))

  # extract unique patterns from a single sequence as rows of a matrix 
  # lpattern is the length of the patterns
  singleSeq <- function(x, lpattern)
  {
      unique(embed(rev(x), lpattern))
  }

  lst1 <- lapply(lst, singleSeq, lpattern=2)
  # combine the matrices to a single matrix
  mat <- do.call(rbind, lst1)
  # convert the patters to strings
  pat <- do.call(paste, c(data.frame(mat), sep=""))
  out <- table(pat)
  out

  pat
  AA AB AC AD BA BC BX CA CB CD DA XY YZ 
   1  3  1  1  1  1  1  1  1  1  1  1  1 

  names(out)[which.max(out)]

  [1] "AB"

Hope this helps.

Petr Savicky.