[R] Find "undirected" duplicates in a tibble

Greg Minshall m|n@h@|| @end|ng |rom um|ch@edu
Fri Aug 20 21:39:08 CEST 2021


Bert,

> The efficiency gains are due to vectorization and the use of more
> efficient primitives. None of this may matter of course, but it seemed
> worth mentioning.

thanks very much!  the varieties of code, and disparities of
performance, are truly wonderful.

Rui's point that what works better for small n is not necessarily what
will work better for large n is important to keep in [my] mind.

as a "so-far" summary, here are some timings.  the relevant code is below.
----
my apply
   user  system elapsed 
  8.397   0.124   8.531 
Bert's !duplicated
   user  system elapsed 
  2.367   0.000   2.370 
Bert's x[,2]>x[,1]
   user  system elapsed 
  1.052   0.000   1.054 
my a.d.f(unique(cbind(do.call)))
   user  system elapsed 
  3.909   0.000   3.914 
Eric Berger's unique(...pmin...pmax)
   user  system elapsed 
  0.848   0.000   0.850 
Eric Berger's transmuting tibble...
   user  system elapsed 
  0.986   0.000   0.988 
Kimmo Elo's [OP] mutating paste
   user  system elapsed 
 52.079   0.000  52.136 
Rui Barradas' sort-based
   user  system elapsed 
 42.327   0.080  42.450 
----

cheers, Greg

----
n <- 1000
x <- expand.grid(Source = 1:n, Target = 1:n)

cat("my apply\n")
system.time({
 y <- apply(x, 1, function(y) return (c(A=min(y), B=max(y))))
 unique(t(y))})
#   user  system elapsed
#  5.075   0.034   5.109

cat("Bert's !duplicated\n")
system.time({
 x[!duplicated(cbind(do.call(pmin, x), do.call(pmax, x))), ]
 })
#   user  system elapsed
#  1.340   0.013   1.353

# Still more efficient and still returning a data frame is:
cat("Bert's x[,2]>x[,1]\n")
system.time({
 w <- x[, 2] > x[,1]
 x[w, ] <- x[w, 2:1]
 unique(x)})
#   user  system elapsed
#  0.693   0.011   0.703

cat("my a.d.f(unique(cbind(do.call)))\n")
system.time({
  as.data.frame(unique(cbind(A=do.call(pmin,x), B=do.call(pmax,x))))
})

cat("Eric Berger's unique(...pmin...pmax)\n")
system.time({
  unique(data.frame(V1=pmin(x$Source,x$Target), V2=pmax(x$Source,x$Target)))
})

cat("Eric Berger's transmuting tibble...\n")
require(dplyr)
xt<-tibble(x)
system.time({
  xt %>% transmute( a=pmin(Source,Target), b=pmax(Source,Target)) %>%
    unique() %>% rename(Source=a, Target=b)
})

cat("Kimmo Elo's [OP] mutating paste\n")
system.time({
  xt %>%
    mutate(pair=mapply(function(x,y)
      paste0(sort(c(x,y)),collapse="-"), Source, Target)) %>%
    distinct(pair,
             .keep_all = T) %>%
    mutate(Source=sapply(pair, function(x)
      unlist(strsplit(x, split="-"))[1]), Target=sapply(pair, function(x)
        unlist(strsplit(x, split="-"))[2])) %>%
    select(-pair)
})

cat("Rui Barradas' sort-based\n")
system.time({
  apply(x, 1, sort) |> t() |> unique()
})



More information about the R-help mailing list