[R] matching similar character strings

David Carlson dcarlson at tamu.edu
Fri Jun 21 16:25:32 CEST 2013


I think you have to assume there could be other coding errors as
well, such as misspellings or abbreviating Street as St. You
probably will need to use sub() to correct A2 and possibly A1 before
trying to merge. To figure where the problems are, you might try
something like this. The last command lists the paste(A1, A2)
entries that do not match anything in paste(B1, B2).

> set.seed(42)
> a <- paste(sample(c(letters, LETTERS[1:5]), 150, replace=TRUE),
+   sample(c("St", "Rd", "Ave"), 150, replace=TRUE))
> b <- paste(sample(letters, 1000, replace=TRUE), 
+   sample(c("St", "Rd", "Ave"), 1000, replace=TRUE))
> (ua <- sort(unique(a)))
 [1] "a Ave" "a Rd"  "A Rd"  "a St"  "A St"  "b Rd"  "B Rd"  "b St"
"c Ave"
[10] "C Ave" "c Rd"  "C Rd"  "C St"  "D Ave" "D Rd"  "d St"  "D St"
"e Ave"
[19] "E Ave" "e Rd"  "E Rd"  "e St"  "E St"  "f Ave" "f Rd"  "g Ave"
"g Rd" 
[28] "g St"  "h Ave" "h Rd"  "h St"  "i Ave" "i Rd"  "i St"  "j Rd"
"j St" 
[37] "k Ave" "k Rd"  "k St"  "l St"  "m Ave" "m Rd"  "m St"  "n Ave"
"n St" 
[46] "o Ave" "o Rd"  "o St"  "p St"  "q Ave" "q Rd"  "q St"  "r Ave"
"r Rd" 
[55] "r St"  "s Ave" "s Rd"  "s St"  "t Ave" "t Rd"  "t St"  "u Ave"
"u Rd" 
[64] "u St"  "v Ave" "v Rd"  "v St"  "w Ave" "w Rd"  "w St"  "x Rd"
"x St" 
[73] "y Ave" "y Rd"  "z Ave" "z Rd"  "z St" 
> (ub <- sort(unique(b)))
 [1] "a Ave" "a Rd"  "a St"  "b Ave" "b Rd"  "b St"  "c Ave" "c Rd"
"c St" 
[10] "d Ave" "d Rd"  "d St"  "e Ave" "e Rd"  "e St"  "f Ave" "f Rd"
"f St" 
[19] "g Ave" "g Rd"  "g St"  "h Ave" "h Rd"  "h St"  "i Ave" "i Rd"
"i St" 
[28] "j Ave" "j Rd"  "j St"  "k Ave" "k Rd"  "k St"  "l Ave" "l Rd"
"l St" 
[37] "m Ave" "m Rd"  "m St"  "n Ave" "n Rd"  "n St"  "o Ave" "o Rd"
"o St" 
[46] "p Ave" "p Rd"  "p St"  "q Ave" "q Rd"  "q St"  "r Ave" "r Rd"
"r St" 
[55] "s Ave" "s Rd"  "s St"  "t Ave" "t Rd"  "t St"  "u Ave" "u Rd"
"u St" 
[64] "v Ave" "v Rd"  "v St"  "w Ave" "w Rd"  "w St"  "x Ave" "x Rd"
"x St" 
[73] "y Ave" "y Rd"  "y St"  "z Ave" "z Rd"  "z St" 
> ua[!(ua %in% ub)]
 [1] "A Rd"  "A St"  "B Rd"  "C Ave" "C Rd"  "C St"  "D Ave" "D Rd"
"D St" 
[10] "E Ave" "E Rd"  "E St" 

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of A M Lavezzi
Sent: Friday, June 21, 2013 4:56 AM
To: r-help
Subject: [R] matching similar character strings

Hello everybody

I have this problem: I need to match an addresses database F1 with
the
information contained in a toponymic database F2.

The format of F1 is given by three columns and 800 rows, with the
columns being:

A1. Street/Road/Avenue
A2. Name
A3. Number

Consider for instance Avenue J. Kennedy , 3011. In F1 this is:

A1. Avenue
A2. J. Kennedy
A3. 3011

The format of F2 file is instead given by 20000 rows and five
columns:

B1. Street/Road/Avenue
B2. Name
B3. Starting Street Number
B4. Ending Street Number
B5. Census section

So my problem is attributing the  B5 Census section to every
observation of F1 if: A1=B1, A2=B2, and A3 is comprised between B3
and
B4.

The problem is that while the information in A2 is irregularly
recorded, B2 has a given format that is Family name (space) Given
name.

So I could have that while in B2 the information is:

Kennedy John

In A2 it could be:

John Kennedy
JF Kennedy
J. Kennedy

and so on.

Thanks,

Mario

-- 
Andrea Mario Lavezzi
Dipartimento di Scienze Giuridiche, della Società e dello Sport
Sezione Diritto e Società
Università di Palermo
Piazza Bologni 8
90134 Palermo, Italy
tel. ++39 091 23892208
fax ++39 091 6111268
skype: lavezzimario
email: mario.lavezzi (at) unipa.it
web: http://www.unipa.it/~mario.lavezzi

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list