[R] deduplication

Epi-schnier christian.schnier at nhs.net
Thu Jun 3 15:14:32 CEST 2010


I am trying to de-duplicate a large (long) database (approx 1mil records) of
diagnostic tests. Individuals in the database can have up-to 25
observations, but most will have only one. IDs for de-duplication (names,
sex, lab number...) are patchy. In a first step, I am using Andreas Borg's
excellent record linkage package (), that leaves me with a list of 'pairs'
looking very much like this:
where a pair means that the records belong to the same individual (e.g.,
record 4 and record 8; 17 and 18...). My problem now is to get a list with
all records that belong to the same person (in the example, obervations
1,3,4,5,6,7,8,12, 17 and 18 are all from the same person). The problem is to
find the link between 1 and 8 (only through 1 and 4 and 4 and 8) and the
link between 1 and 17 (through 18). I can do it in my head, but I am missing
the code that would work its way through too many records.  

Any clever ideas?
(using R 2.10.1 on Windows XP)



View this message in context: http://r.789695.n4.nabble.com/deduplication-tp2241637p2241637.html
Sent from the R help mailing list archive at Nabble.com.

More information about the R-help mailing list