# [R] AGREP

Gabor Grothendieck ggrothendieck at myway.com
Thu Feb 12 16:06:46 CET 2004

```

One could shorten it slightly with these minor improvements.  Unfortunately, the key performance problem, the double loop
at the end which implements the dynamic programming calculation,
is still there.

levenshtein<-function(s1,s2) {
# Make sure args are strings
a <- as.character(s1); an <- nchar(s1)+1
b <- as.character(s2); bn <- nchar(s2)+1

# If one arg is an empty string, returns the length of the other
if (nchar(a)==0) return(nchar(b))
if (nchar(b)==0) return(nchar(a))

# Initialize matrix for calculations
m <- matrix(0, nrow=an, ncol=bn)
m[1,] <- 1:bn
m[,1] <- 1:an

# Cost calculation - line beginning (substr... is 0-1 cost f'n
for (i in 2:an)
for (j in 2:bn)
m[i,j] <- min( m[i-1,j]+1, m[i,j-1]+1, m[i-1,j-1]+
(substr(a,i-1,i-1)!=substr(b,j-1,j-1)) )

# Returns the distance
m[an,bn]-1
}

---
Date:   Thu, 12 Feb 2004 11:01:19 -0300
From:   Rodrigo Abt <rodrigo.abt at sii.cl>
To:   'Lista de Correo de R' <r-help at stat.math.ethz.ch>
Subject:   Re: [R] AGREP

"Marcos Sanches" <marcos.sanches at ipsos-opinion.com.br> writes:

>Hi listers
>
>If you don't know what is the Edit Distance beetwen two strings, I will
>try to explain, in fact it is very simple to understund but not to
>calculate througth a program. It is simplilly the minimum number of
>operations you must perform to transform string A on string B, by
>operations I mean delete letters, insert letters or substitute letter.
>
>If you need to do few operations, it means string A is almost the same
>as string B. Otherwise they are more differente as the number of
>operations increase.
>
>If you have a idea of how to make a function to calculate this distance,
>it would be very important for me.
>
>Thanks very much,
>
>Marcos

I guess you're looking for Levenshtein distance, so try this:

levenshtein<-function(s1,s2) {
# Make sure args are strings
a<-as.character(s1);an=nchar(s1)+1
b<-as.character(s2);bn=nchar(s2)+1

# Initialize matrix for calculations
m<-matrix(nrow=an,ncol=bn)

# If one arg is an empty string, returns the length of the other
if (nchar(a)==0)
return(nchar(b))
if (nchar(b)==0)
return(nchar(a))

# Matrix initialization
for (i in 1:an) {
for (j in 1:bn) {
m[i,j]<-0
m[1,j]<-j
}
m[i,1]<-i
}

# Cost calculation
for (i in 2:an) {
for (j in 2:bn) {
if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
cost<-0
else
cost<-1
m[i,j]=min(m[i-1,j]+1,m[i,j-1]+1,m[i-1,j-1]+cost)
}
}
# Returns the distance
m[an,bn]-1
}

Examples:

[1] 1
> levenshtein("mahrcoz","Marcos") <-- One substitution,one deletion
and one substitution
[1] 3

Note that this function IS case sensitive. If you want to apply this on
vectors of strings you'll have to write the
corresponding wrapper function.

Hope that helps,

Rodrigo Abt B,
Analyst,
Dept. Economic Studies,
SII, Chile.

```