allane at cybaea.com
Thu Jun 3 18:33:01 CEST 2010
Maybe something like the following will get you started:
g <- graph.data.frame(id, directed=FALSE)
There is perhaps a more efficient way, but I hope this helps a little.
On 03/06/10 14:14, Epi-schnier wrote:
> I am trying to de-duplicate a large (long) database (approx 1mil records) of
> diagnostic tests. Individuals in the database can have up-to 25
> observations, but most will have only one. IDs for de-duplication (names,
> sex, lab number...) are patchy. In a first step, I am using Andreas Borg's
> excellent record linkage package (), that leaves me with a list of 'pairs'
> looking very much like this:
> where a pair means that the records belong to the same individual (e.g.,
> record 4 and record 8; 17 and 18...). My problem now is to get a list with
> all records that belong to the same person (in the example, obervations
> 1,3,4,5,6,7,8,12, 17 and 18 are all from the same person). The problem is to
> find the link between 1 and 8 (only through 1 and 4 and 4 and 8) and the
> link between 1 and 17 (through 18). I can do it in my head, but I am missing
> the code that would work its way through too many records.
> Any clever ideas?
> (using R 2.10.1 on Windows XP)
More information about the R-help