[R] Tuning string matching

adi@roda.ro adi at roda.ro
Thu Jan 6 10:32:07 CET 2005


Thank you all for your replies. Indeed I used:
http://finzi.psych.upenn.edu/nmz.html
as a search site. I used "string match", instead of "fuzzy string match".

Fuzzy matching seems to me a rather complicated matter, whereas my initial idea
about solving this problem was a bit simpler:
- check all characters in both strings (a 2 dimensional matrix of characters)
- if 90% (or any other percent) of the characters in both strings are similar
(in terms of distances between each character from the first string to all
characters from the second string), then the two strings will be declared as a
match

I just found out that this algorithm is called the "Levenshtein distance", and I
know there is a PHP function called "levenshtein" (I thought it already might
have been implemented in R).

For anyone that have a clue on how to read this stuff:
http://ro.php.net/levenshtein

I tried to use agrep:
> agrep("Harry Harrington", "Harry Harrington")
[1] 1
> agrep("Harry Harrington", "Harrington Harry")
numeric(0)

So it seems not to be what I'm looking for (I'll try harder with edit distance,
though)

Best regards,
Adrian


Quoting Jonathan Baron <baron at psych.upenn.edu>:

> Sorry for joining late, but I wanted to see if my search page
> could help.  (I don't know which search archive you looked at.)
> I entered
> fuzzy string match*
> and got a few things that look relevant, including the agrep
> function.
>
> As for the second part of the question, that seems to be a coding
> problem that is dependent on the current form of your data.
> Write me off the list and I'll send you an R script I use for
> similar things (making PayPal payments).
>
> Jon
>
>  Dear list,
>
>  I spent about two hours searching on the message archive, with no
>  avail.
>  I have a list of people that have to pass an on-line test, but only a
>  fraction
>  of them do it. Moreover, as they input their names, the resulting
>  string do not
>  always match the names I have in my database.
>
>  I would like to do two things:
>
>  1. Match any strings that are 90% the same
>  Example:
>  name1 <- "Harry Harrington"
>  name2 <- "Harry Harington"
>  I need a function that would declare those strings as a match (ideally
>  having an
>  argument that would allow introducing 80% instead of 90%)
>
> --
> Jonathan Baron, Professor of Psychology, University of Pennsylvania
> Home page: http://www.sas.upenn.edu/~baron
> R search page: http://finzi.psych.upenn.edu/
>




More information about the R-help mailing list