[R] Tuning string matching

McGehee, Robert Robert.McGehee at geodecapital.com
Wed Jan 5 20:36:12 CET 2005


It sounds like what you want is a rudimentary spell-checker whose "word"
is the input name, and whose "dictionary" is an array of your database
names. Spell checking rules are designed to find missing repeats,
transposed letters, extra letters... precisely the reasons you're not
matching your names to your database.

Anyway, as I don't believe R has something like this, what I would do is
simply rewrite one of the dozens of Perl or C spell checkers to fit your
needs (such as Aspell / Ispell), then invoke a script under R using the
"system" call, passing in the student name and your database of names.
And as R can use Perl-like regular expression (?regexpr), you could (if
you really wanted to!) rewrite this into R after the fact, although this
would likely be a waste of time since expression matching is what Perl
is so good for.

You'll also need to think about what this percentage argument is. It's
not obvious to me what percentage of closeness "Robert" and "Robret" are
vs. "Robert" and "RobQQto".

ex: http://tomacorp.com/perl/lingua/style.html
http://aspell.sourceforge.net/

Robert

-----Original Message-----
From: adi at roda.ro [mailto:adi at roda.ro] 
Sent: Wednesday, January 05, 2005 12:36 PM
To: r-help at stat.math.ethz.ch
Subject: [R] Tuning string matching


Dear list,

I spent about two hours searching on the message archive, with no avail.
I have a list of people that have to pass an on-line test, but only a
fraction
of them do it. Moreover, as they input their names, the resulting string
do not
always match the names I have in my database.

I would like to do two things:

1. Match any strings that are 90% the same
Example:
name1 <- "Harry Harrington"
name2 <- "Harry Harington"
I need a function that would declare those strings as a match (ideally
having an
argument that would allow introducing 80% instead of 90%)

2. Arrange a final table that would take me from:

Table1 (the complete list of people from my database)
No Name
1  Byron C. Andrew
2  Friedman Bob
3  Harrington Harry

Table2 (the people having been tested)
No Name               Score
1  Harry Harington    13
2  Byron Andrew       28

to:

No Name1              Name2              Score
1  Byron C. Andrew    Byron Andrew       28
2  Friedman Bob
3  Harrington Harry   Harry Harington    13

Thank you in advance, any help is highly appreciated.
Adrian

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html




More information about the R-help mailing list