[R] Is there a package that can do Fuzzy name matching to standardize names in a single column

Ashim Kapoor @@h|mk@poor @end|ng |rom gm@||@com
Thu Jun 16 04:06:45 CEST 2022


Dear Gregg,

This is what I  meant :-

> df1
             Names
1        John Good
2      Joe Jackson
3    Bob A. Barker
4     John B. Good
5   Joe J. Jackson
6 Bob Allen Barker
7        John Good
8 Joe Jack Johnson
9       Bob Barker

> stringdist_left_join(df1,df1,by="Names",max_dist = 3)
            Names.x          Names.y
1         John Good        John Good
2         John Good     John B. Good
3         John Good        John Good
4       Joe Jackson      Joe Jackson
5       Joe Jackson   Joe J. Jackson
6     Bob A. Barker    Bob A. Barker
7     Bob A. Barker       Bob Barker
8      John B. Good        John Good
9      John B. Good     John B. Good
10     John B. Good        John Good
11   Joe J. Jackson      Joe Jackson
12   Joe J. Jackson   Joe J. Jackson
13 Bob Allen Barker Bob Allen Barker
14        John Good        John Good
15        John Good     John B. Good
16        John Good        John Good
17 Joe Jack Johnson Joe Jack Johnson
18       Bob Barker    Bob A. Barker
19       Bob Barker       Bob Barker
>


You can join a table to itself while tinkering with the max_distance function..
Please notice the clusters that have formed. This has to be cleaned up.

This is similar to the answer by Jan van der Laan.

Best Regards,
Ashim

On Wed, Jun 15, 2022 at 9:13 PM Gregg Powell <g.a.powell using protonmail.com> wrote:
>
>
> Hello Ashim and kind regards for you taking the time to answer back.
>
>
> > library(fuzzyjoin)
> > ?stringdist_left_join
>
> -this will join two tables, but what I am trying to do is just standardize the similarly spelled duplicate names in just the first column of a single table.
>
> I don't think fuzzyjoin will help me in that regard.
>
> Thanks.
> Gregg
> Arizona, USA
>
> ------- Original Message -------
> On Wednesday, June 15th, 2022 at 8:04 AM, Ashim Kapoor <ashimkapoor using gmail.com> wrote:
>
>
> >
>
> >
>
> > Dear Gregg,
> >
>
> > Check this out:
> >
>
> > library(fuzzyjoin)
> > ?stringdist_left_join
> >
>
> > Best Regards,
> > Ashim
> >
>
> > On Wed, Jun 15, 2022 at 8:28 PM Gregg Powell via R-help
> > r-help using r-project.org wrote:
> >
>
> > > Have data sets where there are names, in the first column, client names in the second, and Client start date in the third.
> > >
>
> > > There are thousands of these records with thousands of names/clients/client start dates. The name is entered each time the person begins with a new client such that each person has many entries in the name column. Often the names were not entered in a consistent way. With and without middle initial, middle name, or various abbreviations such as ",RN" at the end of the name.
> > >
>
> > > Is there a package that can do fuzzy name matching so that the names in name column get replaced with a "standardized" format - where some type of machine learning can pick the most common spelling of each repeat name and replace the different variations with the common spelling?
> > >
>
> > > I included an example below. First table includes the names with the various spellings. Second table depicts what I hope to achieve.
> > >
>
> > > Again - this is on a large scale - there are something like 10,000 records with names that need to be standardized.
> > >
>
> > > Name
> > >
>
> > > Client
> > >
>
> > > Client Start Date
> > >
>
> > > John Good
> > >
>
> > > Client 1
> > >
>
> > > 1/1/2020
> > >
>
> > > Joe Jackson
> > >
>
> > > Client 2
> > >
>
> > > 6/1/2020
> > >
>
> > > Bob A. Barker
> > >
>
> > > Client 3
> > >
>
> > > 8/1/2020
> > >
>
> > > John B. Good
> > >
>
> > > Client 4
> > >
>
> > > 10/1/2020
> > >
>
> > > Joe J. Jackson
> > >
>
> > > Client 5
> > >
>
> > > 12/1/2020
> > >
>
> > > Bob Allen Barker
> > >
>
> > > Client 6
> > >
>
> > > 1/1/2021
> > >
>
> > > John Good
> > >
>
> > > Client 7
> > >
>
> > > 5/1/2021
> > >
>
> > > Joe Jack Jackson
> > >
>
> > > Client 8
> > >
>
> > > 8/1/2021
> > >
>
> > > Bob Barker
> > >
>
> > > Client 9
> > >
>
> > > 12/1/2021
> > >
>
> > > Name
> > >
>
> > > Client
> > >
>
> > > Client Start Date
> > >
>
> > > John Good
> > >
>
> > > Client 1
> > >
>
> > > 1/1/2020
> > >
>
> > > Joe J. Jackson
> > >
>
> > > Client 2
> > >
>
> > > 6/1/2020
> > >
>
> > > Bob A. Barker
> > >
>
> > > Client 3
> > >
>
> > > 8/1/2020
> > >
>
> > > John Good
> > >
>
> > > Client 4
> > >
>
> > > 10/1/2020
> > >
>
> > > Joe J. Jackson
> > >
>
> > > Client 5
> > >
>
> > > 12/1/2020
> > >
>
> > > Bob A. Barker
> > >
>
> > > Client 6
> > >
>
> > > 1/1/2021
> > >
>
> > > John Good
> > >
>
> > > Client 7
> > >
>
> > > 5/1/2021
> > >
>
> > > Joe J. Jackson
> > >
>
> > > Client 8
> > >
>
> > > 8/1/2021
> > >
>
> > > Bob A. Barker
> > >
>
> > > Client 9
> > >
>
> > > 12/1/2021
> > >
>
> > > THANKS!
> > >
>
> > > Gregg Powell
> > >
>
> > > Arizona, USA______________________________________________
> > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list