[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

Fri Nov 8 10:34:23 CET 2013

Thanks to Thomas, Martin, Jim and William,

Your input was very informative, and thanks for the reference to Sedgwick.

In the end, it does seem to me that all these algorithms require fast 
lookup by ID of nodes to access data, and that conditional on such fast 
lookup, algorithms are possible with efficiency O(n) or O(n*log(n)) 
(depending on whether lookup time is constant or logarithmic). I believe 
my original algorithm achieves that.

We come back to the fact that I assumed that R environments, implemented 
as hash tables, would give me that fast lookup. But on my systems, their 
efficiency (for insert and lookup) seems to degrade fast at several 
million entries. Certainly much faster than either O(1) or O(log(n)). I 
believe this does not have to do with disk access time. For example, I 
tested this on my desktop computer, running a pure hash insert loop. I 
observe 100% processor use but no disk access, as the size of the hash 
table approaches millions of entries.

I have tested this on two systems, but have not gone into the 
implementation of the hashed environments to look at this in details. If 
others have the same (or different) experiences with using hashed 
environments with millions of entries, it would be very useful to know.

Barring a solution to the hashed environment speed, it seems the way to 
speed this algorithm up (within the confines of R) would be to move away 
from hash tables and towards a numerically indexed array.

Thanks again for all of the help,
Magnus

On 11/4/2013 8:20 PM, Thomas Lumley wrote:
> On Sat, Nov 2, 2013 at 11:12 AM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>     On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote:
>
>         Sure,
>
>         I was attempting to be concise and boiling it down to what I saw
>         as the root
>         issue, but you are right, I could have taken it a step further.
>         So here goes.
>
>         I have a set of around around 20M string pairs. A given string
>         (say, A) can
>         either be equivalent to another string (B) or not. If A and B
>         occur together in
>         the same pair, they are equivalent. But equivalence is
>         transitive, so if A and B
>         occur together in one pair, and A and C occur together in
>         another pair, then A
>         and C are also equivalent. I need a way to quickly determine if
>         any two strings
>         from my data set are equivalent or not.
>
>
>     Do you mean that if A,B occur together and B,C occur together, then
>     A,B and A,C are equivalent?
>
>     Here's a function that returns a unique identifier (not well
>     tested!), allowing for transitive relations but not circularity.
>
>           uid <- function(x, y)
>          {
>              i <- seq_along(x)                   # global index
>              xy <- paste0(x, y)                  # make unique identifiers
>              idx <- match(xy, xy)
>
>              repeat {
>                  ## transitive look-up
>                  y_idx <- match(y[idx], x)       # look up 'y' in 'x'
>                  keep <- !is.na <http://is.na>(y_idx)
>                  if (!any(keep))                 # no transitive
>     relations, done!
>                      break
>                  x[idx[keep]] <- x[y_idx[keep]]
>                  y[idx[keep]] <- y[y_idx[keep]]
>
>                  ## create new index of values
>                  xy <- paste0(x, y)
>                  idx <- match(xy, xy)
>              }
>              idx
>          }
>
>     Values with the same index are identical. Some tests
>
>          > x <- c(1, 2, 3, 4)
>          > y <- c(2, 3, 5, 6)
>          > uid(x, y)
>          [1] 1 1 1 4
>          > i <- sample(x); uid(x[i], y[i])
>          [1] 1 1 3 1
>          > uid(as.character(x), as.character(y))  ## character() ok
>          [1] 1 1 1 4
>          > uid(1:10, 1 + 1:10)
>           [1] 1 1 1 1 1 1 1 1 1 1
>          > uid(integer(), integer())
>          integer(0)
>          > x <- c(1, 2, 3)
>          > y <- c(2, 3, 1)
>          > uid(x, y)                              ## circular!
>            C-c C-c
>
>     I think this will scale well enough, but the worst-case scenario can
>     be made to be log(longest chain) and copying can be reduced by using
>     an index i and subsetting the original vector on each iteration. I
>     think you could test for circularity by checking that the updated x
>     are not a permutation of the kept x, all(x[y_idx[keep]] %in% x[keep]))
>
>     Martin
>
>
>
> This problem (union-find) is discussed in Chapter 1 of Sedgwick's
> "Algorithms".  There's an algorithm given that takes linear time to
> build the structure, worst-case logarithmic time to query, and
> effectively constant average time to query (inverse-Ackerman amortized
> complexity).
>
>     -thomas
>
> --
> Thomas Lumley
> Professor of Biostatistics
> University of Auckland