[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

Fri Nov 1 16:22:37 CET 2013

Sure,

I was attempting to be concise and boiling it down to what I saw as the 
root issue, but you are right, I could have taken it a step further. So 
here goes.

I have a set of around around 20M string pairs. A given string (say, A) 
can either be equivalent to another string (B) or not. If A and B occur 
together in the same pair, they are equivalent. But equivalence is 
transitive, so if A and B occur together in one pair, and A and C occur 
together in another pair, then A and C are also equivalent. I need a way 
to quickly determine if any two strings from my data set are equivalent 
or not.

The way I do this currently is to designate the smallest 
(alphabetically) string in each known equivalence set as the "main" 
entry. For each pair, I therefore insert two entries into the hash 
table, both pointing at the mail value. So assuming the input data:

A,B
B,C
D,E

I would then have:

A->A
B->A
C->B
D->D
E->D

Except that I also follow each chain until I reach the end (key==value), 
and insert pointers to the "main" value for every value I find along the 
way. After doing that, I end up with:

A->A
B->A
C->A
D->D
E->D

And I can very quickly check equivalence, either by comparing the hash 
of two strings, or simply by transforming each string into its hash, and 
then I can use simple comparison from then on. The code for generating 
the final hash table is as follows:

h : Empty hash table created with hash.new()
d : Input data
hash.deep.get : Function that iterates through the hash table until it 
finds a key whose value is equal to itself (until hash.get(X)==X), then 
returns all the values in a vector

h = hash.new()
for ( i in 1:nrow(d) )
{
     deep.a      = hash.deep.get(h, d$a[i])
     deep.b      = hash.deep.get(h, d$b[i])
     equivalents = sort(unique(c(deep.a,deep.b)))
     equiv.id    = min(equivalents)
     for ( equivalent in equivalents )
     {
         hash.put(h, equivalent, equiv.id)
     }
}

I would so much appreciate if there was a simpler and faster way to do 
this. Keeping my fingers crossed that one of the R-help geniuses who 
sees this is sufficiently interested to crack the problem

Best,
Magnus

On 11/1/2013 1:49 PM, jim holtman wrote:
> It would be nice if you followed the posting guidelines and at least
> showed the script that was creating your entries now so that we
> understand the problem you are trying to solve.  A bit more
> explanation of why you want this would be useful.  This gets to the
> second part of my tag line:  Tell me what you want to do, not how you
> want to do it.  There may be other solutions to your problem.
>
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
>
> On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
> <zulutime.net at gmail.com> wrote:
>> Pretty much what the subject says:
>>
>> I used an env as the basis for a Hashtable in R, based on information that
>> this is in fact the way environments are implemented under the hood.
>>
>> I've been experimenting with doubling the number of entries, and so far it
>> has seemed to be scaling more or less linearly, as expected.
>>
>> But as I went from 17 million entries to 34 million entries, the completion
>> time has gone from 18 hours, to 5 days and counting.
>>
>>
>> The keys and values are in all cases strings of equal length.
>>
>> One might suspect that the slow-down might have to do with the memory being
>> swapped to disk, but from what I know about my computing environment, that
>> should not be the case.
>>
>> So my first question:
>> Is anyone familiar with anything in the implementation of environments that
>> would limit their use or slow them down (faster than O(nlog(n)) as the
>> number of entries is increased?
>>
>> And my second question:
>> I realize that this is not strictly what R environments were designed for,
>> but this is what my algorithm requires: I must go through these millions of
>> entries, storing them in the hash table and sometimes retrieving them along
>> the way, in a more or less random manner, which is contingent on the data I
>> am encountering, and on the contents of the hash table at each moment.
>>
>> Does anyone have a good recommendation for alternatives to implement huge,
>> fast, table-like structures in R?
>>
>> Best,
>> Magnus
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>