[R] Is there a package that can do Fuzzy name matching to standardize names in a single column

Jan van der Laan rhe|p @end|ng |rom eoo@@dd@@n|
Wed Jun 15 21:31:08 CEST 2022


The reclin2 package (by me)  also has functionality for finding 
duplicate records. So it should handle finding names that are likely to 
be the same. See here for a vignette with an example where different 
variants of town names are clustered: 
https://cran.r-project.org/web/packages/reclin2/vignettes/deduplication.html 
.

In some of the cases where I used this, the quality of the matches 
improved immensely with a large number of manual preprocessing of the 
names. These were mostly done using regular expressions, for example 
using gsub. For example removing accents, hyphens, replacing common 
variants of names with the most common one.

HTH,

Jan




On 15-06-2022 16:57, Gregg Powell via R-help wrote:
> Have data sets where there are names, in the first column, client names in the second, and Client start date in the third.
>
> There are thousands of these records with thousands of names/clients/client start dates. The name is entered each time the person begins with a new client such that each person has many entries in the name column. Often the names were not entered in a consistent way. With and without middle initial, middle name, or various abbreviations such as ",RN" at the end of the name.
>
> Is there a package that can do fuzzy name matching so that the names in name column get replaced with a "standardized" format - where some type of machine learning can pick the most common spelling of each repeat name and replace the different variations with the common spelling?
>
> I included an example below. First table includes the names with the various spellings. Second table depicts what I hope to achieve.
>
> Again - this is on a large scale - there are something like 10,000 records with names that need to be standardized.
>
>
> Name
>
> Client
>
> Client Start Date
>
> John Good
>
> Client 1
>
> 1/1/2020
>
> Joe Jackson
>
> Client 2
>
> 6/1/2020
>
> Bob A. Barker
>
> Client 3
>
> 8/1/2020
>
> John B. Good
>
> Client 4
>
> 10/1/2020
>
> Joe J. Jackson
>
> Client 5
>
> 12/1/2020
>
> Bob Allen Barker
>
> Client 6
>
> 1/1/2021
>
> John Good
>
> Client 7
>
> 5/1/2021
>
> Joe Jack Jackson
>
> Client 8
>
> 8/1/2021
>
> Bob Barker
>
> Client 9
>
> 12/1/2021
>
>   
>
>   
>
>   
>
> Name
>
> Client
>
> Client Start Date
>
> John Good
>
> Client 1
>
> 1/1/2020
>
> Joe J. Jackson
>
> Client 2
>
> 6/1/2020
>
> Bob A. Barker
>
> Client 3
>
> 8/1/2020
>
> John Good
>
> Client 4
>
> 10/1/2020
>
> Joe J. Jackson
>
> Client 5
>
> 12/1/2020
>
> Bob A. Barker
>
> Client 6
>
> 1/1/2021
>
> John Good
>
> Client 7
>
> 5/1/2021
>
> Joe J. Jackson
>
> Client 8
>
> 8/1/2021
>
> Bob A. Barker
>
> Client 9
>
> 12/1/2021
>
>
>
> THANKS!
>
> Gregg Powell
>
> Arizona, USA
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list