[Rd] Efficient Merging of two huge sorted data frames?---Use merge()?

Prof Brian Ripley ripley at stats.ox.ac.uk
Tue May 9 09:50:36 CEST 2006


merge() is not optimized for large data frames.  To do things on this 
scale you really want to be using a DBMS not R.  See the `R Data 
Import/Export Manual'.

Sorting is not really relevant, especially as merge is not assuming that 
the match is unique.  Hashing could be used, but is not.

As R is open source, you have the source code and it would be kinder to 
read it yourself rather than expect this list to read it for you.  A 
useful contribution to the R project would be to contribute a more 
efficient version, and we look forwards to seeing your contribution.

On Mon, 8 May 2006, Charles Cheung wrote:

> Hello all,
>
> A problem I encounter today is the speed which takes to sort two huge data
> frames...
>
> I wish to sort by (X,Y)
>
> Dataframe One consists of variables:
> X, Y, sequence, position
> having ~700 000 records
>
> another dataframe consists of
> X,Y, intensities
> having ~900 000 records
>
>
> Every (X,Y) pair in dataframe One is included in dataframe Two,
> however,  the reverse is not true.
> Furthermore,  (X,Y, position) in data frame One makes the record unique.
> (That means there can be multiple records with the same (X,Y) records!)
>
> Added together, it makes it hard to just combine the two data frames
> together by simply going
> data.frame(dataFrameOne, dataFrameTwo) because the mapping won't correspond
> even in sorted records by X and Y.
>
>
> Intuitive, it should only require very little time <O(n) complexity> after
> the data records are sorted.
> However, it takes so long (I haven't finished the process in 20 minutes.. it
> should only take <1 min) to merge the list by X and Y using
>
> merge(dataFrameOne, dataFrameTwo, by=c("X","Y") , which leads me to suspect
> this process is not optimized for already sorted list.
>
> * assuming the two frames have been sorted, I would be able to do the
> following:
>
>
> X Y seq Pos
> 1 1   AA  32
> 1 2   AG  44
> 1 3   GC  65
>
>
> X Y intensities
> 1 1  0.4
> 1 3  0.552
>
>>> Cursor at beginning (1,1) (1,1) -->merge the (1,1) pair.. then cursor
>>> moves to (1,2) (1,3)  --> can't find..     cursor moves to (1,3) (1,3) ..
>>> merge that pair
>
> Is the merge function doing that already?
>
>
> Is there an efficient way to merge the data frames? (What do you suggest I
> should do?)
>
>
> (to produce)
> X Y seq pos intensities
> 1 1 AA   32     0.4
> 1 3 GC  65     0.552
>
> Thank you in advance!
>
>
> Charles Cheung
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list