[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

William Dunlap wdunlap at tibco.com
Fri Nov 1 17:52:53 CET 2013


Have you looked into the 'igraph' package?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of Magnus Thor Torfason
> Sent: Friday, November 01, 2013 8:23 AM
> To: r-help at r-project.org
> Subject: Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+
> days
> 
> Sure,
> 
> I was attempting to be concise and boiling it down to what I saw as the
> root issue, but you are right, I could have taken it a step further. So
> here goes.
> 
> I have a set of around around 20M string pairs. A given string (say, A)
> can either be equivalent to another string (B) or not. If A and B occur
> together in the same pair, they are equivalent. But equivalence is
> transitive, so if A and B occur together in one pair, and A and C occur
> together in another pair, then A and C are also equivalent. I need a way
> to quickly determine if any two strings from my data set are equivalent
> or not.
> 
> The way I do this currently is to designate the smallest
> (alphabetically) string in each known equivalence set as the "main"
> entry. For each pair, I therefore insert two entries into the hash
> table, both pointing at the mail value. So assuming the input data:
> 
> A,B
> B,C
> D,E
> 
> I would then have:
> 
> A->A
> B->A
> C->B
> D->D
> E->D
> 
> Except that I also follow each chain until I reach the end (key==value),
> and insert pointers to the "main" value for every value I find along the
> way. After doing that, I end up with:
> 
> A->A
> B->A
> C->A
> D->D
> E->D
> 
> And I can very quickly check equivalence, either by comparing the hash
> of two strings, or simply by transforming each string into its hash, and
> then I can use simple comparison from then on. The code for generating
> the final hash table is as follows:
> 
> h : Empty hash table created with hash.new()
> d : Input data
> hash.deep.get : Function that iterates through the hash table until it
> finds a key whose value is equal to itself (until hash.get(X)==X), then
> returns all the values in a vector
> 
> 
> h = hash.new()
> for ( i in 1:nrow(d) )
> {
>      deep.a      = hash.deep.get(h, d$a[i])
>      deep.b      = hash.deep.get(h, d$b[i])
>      equivalents = sort(unique(c(deep.a,deep.b)))
>      equiv.id    = min(equivalents)
>      for ( equivalent in equivalents )
>      {
>          hash.put(h, equivalent, equiv.id)
>      }
> }
> 
> 
> I would so much appreciate if there was a simpler and faster way to do
> this. Keeping my fingers crossed that one of the R-help geniuses who
> sees this is sufficiently interested to crack the problem
> 
> Best,
> Magnus
> 
> On 11/1/2013 1:49 PM, jim holtman wrote:
> > It would be nice if you followed the posting guidelines and at least
> > showed the script that was creating your entries now so that we
> > understand the problem you are trying to solve.  A bit more
> > explanation of why you want this would be useful.  This gets to the
> > second part of my tag line:  Tell me what you want to do, not how you
> > want to do it.  There may be other solutions to your problem.
> >
> > Jim Holtman
> > Data Munger Guru
> >
> > What is the problem that you are trying to solve?
> > Tell me what you want to do, not how you want to do it.
> >
> >
> > On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
> > <zulutime.net at gmail.com> wrote:
> >> Pretty much what the subject says:
> >>
> >> I used an env as the basis for a Hashtable in R, based on information that
> >> this is in fact the way environments are implemented under the hood.
> >>
> >> I've been experimenting with doubling the number of entries, and so far it
> >> has seemed to be scaling more or less linearly, as expected.
> >>
> >> But as I went from 17 million entries to 34 million entries, the completion
> >> time has gone from 18 hours, to 5 days and counting.
> >>
> >>
> >> The keys and values are in all cases strings of equal length.
> >>
> >> One might suspect that the slow-down might have to do with the memory being
> >> swapped to disk, but from what I know about my computing environment, that
> >> should not be the case.
> >>
> >> So my first question:
> >> Is anyone familiar with anything in the implementation of environments that
> >> would limit their use or slow them down (faster than O(nlog(n)) as the
> >> number of entries is increased?
> >>
> >> And my second question:
> >> I realize that this is not strictly what R environments were designed for,
> >> but this is what my algorithm requires: I must go through these millions of
> >> entries, storing them in the hash table and sometimes retrieving them along
> >> the way, in a more or less random manner, which is contingent on the data I
> >> am encountering, and on the contents of the hash table at each moment.
> >>
> >> Does anyone have a good recommendation for alternatives to implement huge,
> >> fast, table-like structures in R?
> >>
> >> Best,
> >> Magnus
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list