[R] partial match for two datasets

David Winsemius dwinsemius at comcast.net
Wed Dec 9 04:23:05 CET 2009


On Dec 8, 2009, at 8:46 PM, Lynn Wang wrote:

>
>
> Hi all,
>
> I have two sets:
>
> dig<-c("DAVID ADAMS","PIERS AKERMAN","SHERYLE BAGWELL","JULIAN  
> BAJKOWSKI","CANDIDA BAKER")
>
> import<-c("by DAVID ADAMS","piersAKERMAN","SHERYLE BagWEL","JULIAN  
> BAJKOWSKI with ","Cand BAKER","smith green")
>
>
> I want to get the following result from "import" after comparing the  
> two sets
>
> result<-c("by DAVID ADAMS","piersAKERMAN","JULIAN BAJKOWSKI with ")

 > sapply(dig, function(x) grep(x, import) ) >0
      DAVID ADAMS    PIERS AKERMAN  SHERYLE BAGWELL JULIAN  
BAJKOWSKI    CANDIDA BAKER
             TRUE               NA               NA              
TRUE               NA

#Not exactly so need a partial match function that is more flexible.  
Unfortunately the Levenshtein function in MiscPsycho is not vectorized:


 > import<-c("by DAVID ADAMS","piersAKERMAN","SHERYLE BagWEL","JULIAN  
BAJKOWSKI with ","Cand BAKER","smith green")
 > dig<-c("DAVID ADAMS","PIERS AKERMAN","SHERYLE BAGWELL","JULIAN  
BAJKOWSKI","CANDIDA BAKER")
 > library(MiscPsycho)
 > import<-c("by DAVID ADAMS","piersAKERMAN","SHERYLE BagWEL","JULIAN  
BAJKOWSKI with ","Cand BAKER","smith green")
 > word.pairs <- expand.grid(dig,import)
 > wordpairs <- lapply(word.pairs,  as.character)
 > wp2 <-data.frame(dig= wordpairs[[1]], import=wordpairs[[2]],  
stringsAsFactors=F)
 > wp2$distnc <- apply(wp2, 1, function(x) stringMatch( x[1], x[2] ) )
 >  wp2[wp2$distnc >.7, ]
                 dig                 import    distnc
1       DAVID ADAMS         by DAVID ADAMS 0.7142857
7     PIERS AKERMAN           piersAKERMAN 0.9230769
13  SHERYLE BAGWELL         SHERYLE BagWEL 0.9333333
19 JULIAN BAJKOWSKI JULIAN BAJKOWSKI with  0.7272727
25    CANDIDA BAKER             Cand BAKER 0.7692308


(I think you missed a couple of obvious matches that ought to be in  
the list)

-- 
David

>
>
> I created a "partialmatch" function as follow, but can not get right  
> result.
>
> partialmatch<- function(x, y) as.vector(y[regexpr(as.character(x),  
> as.character(y), ignore.case = TRUE)>0])
>
> result<-partialmatch(dig,import)
>
>
> [1] "by DAVID ADAMS"
>
>
>
> Thanks,
>
> lynn
>
>
>       
> __________________________________________________________________________________
> Win 1 of 4 Sony home entertainment packs thanks to Yahoo!7.
> Enter now: http://au.docs.yahoo.com/homepageset/
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list