[R] Find "undirected" duplicates in a tibble

Kimmo Elo k|mmo@e|o @end|ng |rom utu@||
Fri Aug 20 10:59:34 CEST 2021


I am working with a large network data consisting of source-target
pairs stored in a tibble. Now I need to transform the directed dataset
to an undirected network data. This means, I need to keep only one
instance for pairs with the same "nodes". In other words, if my data
has one row with A (source) and B (target) and one with B (source) and
A (target), only the pair A-B should be kept.

Here an example how I have solved this problem so far:

--- snip ---

# Create some data
x<-tibble(Source=rep(1:3,4), Target=c(rep(1,3),rep(2,3),rep(3,3),rep(4,3)))
x	# print original data

# Remove "undirected" duplicates
x<-x %>% mutate(pair=mapply(function(x,y)
paste0(sort(c(x,y)),collapse="-"), Source, Target)) %>% distinct(pair,
.keep_all = T) %>% mutate(Source=sapply(pair, function(x)
unlist(strsplit(x, split="-"))[1]), Target=sapply(pair, function(x)
unlist(strsplit(x, split="-"))[2])) %>% select(-pair)

x	# print cleaned data

--- snip ---

The good thing with my own solution is that it allows the creation of
weighted pairs as well. One just needs to replace 'distinct(pair,
.keep_all=T)' with 'count(pair)'.

I have done a lot of searching but not found any function providing
this functionality. Does someone know an alternative, maybe a more
effective function/solution?


Kimmo Elo

Dr. Kimmo Elo
Senior researcher in European Studies
University of Turku
Centre for Parliamentary Studies
E-mail: kimmo.elo using utu.fi

More information about the R-help mailing list