# [R] AGREP

Henrik Bengtsson hb at maths.lth.se
Thu Feb 12 17:11:44 CET 2004

```> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Gabor
> Grothendieck
> Sent: den 12 februari 2004 16:07
> To: rodrigo.abt at sii.cl; r-help at stat.math.ethz.ch
> Subject: Re: [R] AGREP
>
> One could shorten it slightly with these minor improvements.
> Unfortunately, the key performance problem, the double loop
> at the end which implements the dynamic programming calculation,
> is still there.
>
> levenshtein<-function(s1,s2) {
>      # Make sure args are strings
>      a <- as.character(s1); an <- nchar(s1)+1
>      b <- as.character(s2); bn <- nchar(s2)+1
>
>      # If one arg is an empty string, returns the length of the
other
>      if (nchar(a)==0) return(nchar(b))
>      if (nchar(b)==0) return(nchar(a))
>
>      # Initialize matrix for calculations
>      m <- matrix(0, nrow=an, ncol=bn)
>      m[1,] <- 1:bn
>      m[,1] <- 1:an
>
>      # Cost calculation - line beginning (substr... is 0-1 cost f'n
>      for (i in 2:an)
>           for (j in 2:bn)
> 		  m[i,j] <- min( m[i-1,j]+1, m[i,j-1]+1, m[i-1,j-1]+
> 		       (substr(a,i-1,i-1)!=substr(b,j-1,j-1)) )
>
>      # Returns the distance
>      m[an,bn]-1
> }
>

But a very expensive part of the code though is the substr() calls.
Instead of doing this nchar(a)*nchar(b) times it's enough to do it
nchar(a)+nchar(b). Even better is to use strsplit() first as below:

levenshteinFast <- function(s1,s2) {
# Make sure args are strings
a <- as.character(s1)
b <- as.character(s2)

# Split strings into vectors
a <- strsplit(a, split="")[[1]]
b <- strsplit(b, split="")[[1]]

# If one arg is an empty string, returns the length of the other
an <- length(a)
bn <- length(b)
if (an==0) return(bn)
if (bn==0) return(an)

# Initialize matrix for calculations
m <- matrix(0, nrow=an+1, ncol=bn+1)
m[1,] <- 1:(bn+1)
m[,1] <- 1:(an+1)

# Cost calculation - line beginning (substr... is 0-1 cost f'n
for (i in 2:(an+1)) {
for (j in 2:(bn+1)) {
m[i,j] <- min(m[i-1,j  ] + 1,
m[i  ,j-1] + 1,
m[i-1,j-1] + !identical(a[i-1],b[j-1]))
}
}

# Returns the distance
m[an+1,bn+1]-1;
} # levenshteinFast()

# Example
N <- 500
s1 <- sample(letters, size=N, replace=TRUE)
s1 <- paste(s1, collapse="")
s2 <- sample(letters, size=N, replace=TRUE)
s2 <- paste(s2, collapse="")

t1 <- system.time(dist1 <- levenshtein(s1,s2))
print(c(t1,dist1))
# [1]  46.83   0.23  54.24     NA     NA 443.00

t2 <- system.time(dist2 <- levenshteinFast(s1,s2))
print(c(t2,dist2))
# [1]  18.82   0.07  20.90     NA     NA 443.00

/Henrik

> ---
> Date:   Thu, 12 Feb 2004 11:01:19 -0300
> From:   Rodrigo Abt <rodrigo.abt at sii.cl>
> To:   'Lista de Correo de R' <r-help at stat.math.ethz.ch>
> Subject:   Re: [R] AGREP
>
>
> "Marcos Sanches" <marcos.sanches at ipsos-opinion.com.br> writes:
>
> >Hi listers
> >
> >If you don't know what is the Edit Distance beetwen two
> strings, I will
> >try to explain, in fact it is very simple to understund but not to
> >calculate througth a program. It is simplilly the minimum number of

> >operations you must perform to transform string A on string B, by
> >operations I mean delete letters, insert letters or
> substitute letter.
> >
> >If you need to do few operations, it means string A is
> almost the same
> >as string B. Otherwise they are more differente as the number of
> >operations increase.
> >
> >If you have a idea of how to make a function to calculate this
> >distance, it would be very important for me.
> >
> >Thanks very much,
> >
> >Marcos
>
> I guess you're looking for Levenshtein distance, so try this:
>
> levenshtein<-function(s1,s2) {
>      # Make sure args are strings
>      a<-as.character(s1);an=nchar(s1)+1
>      b<-as.character(s2);bn=nchar(s2)+1
>
>      # Initialize matrix for calculations
>      m<-matrix(nrow=an,ncol=bn)
>
>      # If one arg is an empty string, returns the length of the
other
>      if (nchar(a)==0)
>           return(nchar(b))
>      if (nchar(b)==0)
>           return(nchar(a))
>
>      # Matrix initialization
>      for (i in 1:an) {
>           for (j in 1:bn) {
>                m[i,j]<-0
>                m[1,j]<-j
>           }
>           m[i,1]<-i
>      }
>
>      # Cost calculation
>      for (i in 2:an) {
>           for (j in 2:bn) {
>                if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
>                     cost<-0
>                else
>                     cost<-1
>           m[i,j]=min(m[i-1,j]+1,m[i,j-1]+1,m[i-1,j-1]+cost)
>           }
>      }
>      # Returns the distance
>      m[an,bn]-1
> }
>
> Examples:
>
> > levenshtein("Great","Grreat")          <-- One addition
> [1] 1
> > levenshtein("mahrcoz","Marcos") <-- One substitution,one deletion
> and one substitution
> [1] 3
>
> Note that this function IS case sensitive. If you want to
> apply this on vectors of strings you'll have to write the
> corresponding wrapper function.
>
> Hope that helps,
>
> Rodrigo Abt B,
> Analyst,
> Dept. Economic Studies,
> SII, Chile.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailma> n/listinfo/r-help
> do read the posting guide!
> http://www.R-project.org/posting-guide.html

```