[Rd] Efficient Merging of two huge sorted data frames?---Use merge()?

Charles Cheung boom2k1 at hotmail.com
Tue May 9 02:20:23 CEST 2006


Hello all,

A problem I encounter today is the speed which takes to sort two huge data 
frames...

I wish to sort by (X,Y)

Dataframe One consists of variables:
X, Y, sequence, position
having ~700 000 records

another dataframe consists of
X,Y, intensities
having ~900 000 records


Every (X,Y) pair in dataframe One is included in dataframe Two,
however,  the reverse is not true.
Furthermore,  (X,Y, position) in data frame One makes the record unique.
(That means there can be multiple records with the same (X,Y) records!)

Added together, it makes it hard to just combine the two data frames 
together by simply going
data.frame(dataFrameOne, dataFrameTwo) because the mapping won't correspond 
even in sorted records by X and Y.


Intuitive, it should only require very little time <O(n) complexity> after 
the data records are sorted.
However, it takes so long (I haven't finished the process in 20 minutes.. it 
should only take <1 min) to merge the list by X and Y using

merge(dataFrameOne, dataFrameTwo, by=c("X","Y") , which leads me to suspect 
this process is not optimized for already sorted list.

* assuming the two frames have been sorted, I would be able to do the 
following:


X Y seq Pos
1 1   AA  32
1 2   AG  44
1 3   GC  65


X Y intensities
1 1  0.4
1 3  0.552

>>Cursor at beginning (1,1) (1,1) -->merge the (1,1) pair.. then cursor 
>>moves to (1,2) (1,3)  --> can't find..     cursor moves to (1,3) (1,3) .. 
>>merge that pair

Is the merge function doing that already?


Is there an efficient way to merge the data frames? (What do you suggest I 
should do?)


(to produce)
X Y seq pos intensities
1 1 AA   32     0.4
1 3 GC  65     0.552

Thank you in advance!


Charles Cheung



More information about the R-devel mailing list