Adeel - SafeGreenCapital adeel.amin at gmail.com
Thu May 23 15:05:37 CEST 2013

Hi Rainer:

Thanks for the reply.  Posting the large dataset is a task.  There are 8M
rows between the two of them and the first discrepancy in the data doesn't
happen until at least the 40,000th row on each dataframe.  The examples I
posted are a pretty good abstraction of the root of the issue.  

The problem isn't the data.  The problem is Out Of Memory issues when doing
any operations like merge, rbind, etc.  The solution that Blaser suggested
in his post works great, but the systems quickly run out of memory.  What
does work without OOM issues are for/while loops but on average take an
inordinate time to compute and tie up a machine for hours and hours at time.
Essentially I break the data apart, add rows and rebind.  It's a brute force
type of approach and run times are in excess of 48 hours for one full
iteration across 25 data frames.  Terrible.

I am about to go down the road of using data.tables class as its far more
memory efficient, but the documentation is cryptic. Your idea of creating a
super set has some merit and it's what I was experimenting with prior to my
original post.  

