[R] Nested functions

Thomas Lumley tlumley at u.washington.edu
Tue Jul 18 16:55:31 CEST 2006


On Mon, 17 Jul 2006, John Wiedenhoeft wrote:

> Hi there,
>
> I'm having myself a hard time writing an algorithm for finding patterns
> within a given melody. In a vector I'd like to find ALL sequences that
> occur at least twice, without having to check all possible patterns via
> pattern matching.
>

Another approach, which works for not-too-long vectors like you have is:

   n <- length(v)
   matches <- outer(v, v, "==") & outer(1:n,1:n,">=")

Now matches has TRUE where v[i]==v[j]. For a longer match you would also 
need v[i+1]==v[j+1] and so on, making a diagonal line through the matrix. 
Diagonal lines are hard, so let's turn them into horizontal lines

   matches <- matrix(cbind(matches, FALSE), ncol=n)

now row i+1 column j of matches is TRUE for a single entry match starting 
at position j at a separation of i.  If there is a match of length 2, then 
column j+1 will also be TRUE, and so on.

Now rle() applied to a row will return the lengths of consecutive 
sequences of TRUE and FALSE. The lengths of consecutive sequences of TRUE 
are the lengths of the matches. To get rid of trivial matches of length 
less than 2 do
   match2 <-  t(apply(matches,1,function(row){
                      r<-rle(row)
                      r$values[r$lengths<2]<-FALSE
 	             inverse.rle(r)
                   }))



And finally, to extract the matches
   results <- apply(match2, 1, function(row){
                            r<-rle(row)
                            n<-length(r$lengths)
                            ends<-cumsum(r$lengths)
                            starts<-cumsum(c(1,r$lengths))[1:n]
                            list(starts[r$values],ends[r$values])
                     })
for starts and ends of matches or
   results <- apply(match2, 1, function(row){
                          r<-rle(row)
                          n<-length(r$lengths)
                          ends<-cumsum(r$lengths)[r$values]
                          starts<-cumsum(c(1,r$lengths))[1:n][r$values]
                          mapply(function(stt,end) v[stt:end],starts,ends,
                                  SIMPLIFY=FALSE)
                     })
to get a list of the actual matching sequences.


 	-thomas



More information about the R-help mailing list