[R] Tuning string matching

Stephen Upton supton at referentia.com
Thu Jan 6 13:01:35 CET 2005


Adrian,

As an exercise, I took the pseudocode on the wiki pages for the Levenshtein
distance and programmed it in R. The code is below. I tested it for just 2
strings, so I'm not claiming that it *really* works, but it seems to. As you
can see, I didn't add any error checking, and there is likely some cool R
shortcuts that could be added. 

As to your problem, I'd also suggest that you might want to apply the below
function to possible combinations of words rather than attempting to apply
the function to a complete name; that should alleviate the "first.name
last.name", "last.name first.name" problem. 

> levenshtein.distance("Harrington","Harington")
[1] 1

> levenshtein.distance("Harrington Harry","Harry Harington")
[1] 11

HTH
Steve


levenshtein.distance <- function(string.1, string.2, subst.cost=1) {
    c1 <- strsplit(string.1,split="")[[1]]
    c2 <- strsplit(string.2,split="")[[1]]
    n <- length(c1)
    m <- length(c2)

    d <- array(0,dim=c(n+1,m+1))
 
    d[,1] <- 1:(n+1)
    d[1,] <- 1:(m+1)
    d[1,1] <- 0		


    for (i in 2:(n+1)) {

        for (j in 2:(m+1)) {

            if (c1[i-1] == c2[j-1]) cost <- 0 else cost <- subst.cost

            d[i,j] <- min(d[i-1,j] + 1,    # insertion
                          d[i,j-1] + 1,    # deletion
                          d[i-1,j-1] + cost) # substitution
	}
    }

    d[n+1,m+1]

}

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-
> bounces at stat.math.ethz.ch] On Behalf Of adi at roda.ro
> Sent: Thursday, January 06, 2005 4:32 AM
> To: Jonathan Baron
> Cc: McGehee, Robert; bogdan romocea; r-help at stat.math.ethz.ch
> Subject: Re: [R] Tuning string matching
> 
> 
> Thank you all for your replies. Indeed I used:
> http://finzi.psych.upenn.edu/nmz.html
> as a search site. I used "string match", instead of "fuzzy string match".
> 
> Fuzzy matching seems to me a rather complicated matter, whereas my initial
> idea
> about solving this problem was a bit simpler:
> - check all characters in both strings (a 2 dimensional matrix of
> characters)
> - if 90% (or any other percent) of the characters in both strings are
> similar
> (in terms of distances between each character from the first string to all
> characters from the second string), then the two strings will be declared
> as a
> match
> 
> I just found out that this algorithm is called the "Levenshtein distance",
> and I
> know there is a PHP function called "levenshtein" (I thought it already
> might
> have been implemented in R).
> 
> For anyone that have a clue on how to read this stuff:
> http://ro.php.net/levenshtein
> 
> I tried to use agrep:
> > agrep("Harry Harrington", "Harry Harrington")
> [1] 1
> > agrep("Harry Harrington", "Harrington Harry")
> numeric(0)
> 
> So it seems not to be what I'm looking for (I'll try harder with edit
> distance,
> though)
> 
> Best regards,
> Adrian
> 
> 
> Quoting Jonathan Baron <baron at psych.upenn.edu>:
> 
> > Sorry for joining late, but I wanted to see if my search page
> > could help.  (I don't know which search archive you looked at.)
> > I entered
> > fuzzy string match*
> > and got a few things that look relevant, including the agrep
> > function.
> >
> > As for the second part of the question, that seems to be a coding
> > problem that is dependent on the current form of your data.
> > Write me off the list and I'll send you an R script I use for
> > similar things (making PayPal payments).
> >
> > Jon
> >
> >  Dear list,
> >
> >  I spent about two hours searching on the message archive, with no
> >  avail.
> >  I have a list of people that have to pass an on-line test, but only a
> >  fraction
> >  of them do it. Moreover, as they input their names, the resulting
> >  string do not
> >  always match the names I have in my database.
> >
> >  I would like to do two things:
> >
> >  1. Match any strings that are 90% the same
> >  Example:
> >  name1 <- "Harry Harrington"
> >  name2 <- "Harry Harington"
> >  I need a function that would declare those strings as a match (ideally
> >  having an
> >  argument that would allow introducing 80% instead of 90%)
> >
> > --
> > Jonathan Baron, Professor of Psychology, University of Pennsylvania
> > Home page: http://www.sas.upenn.edu/~baron
> > R search page: http://finzi.psych.upenn.edu/
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-
> guide.html




More information about the R-help mailing list