[R] merge( , by='row.names') slowness

rex.dwyer at syngenta.com rex.dwyer at syngenta.com
Thu Mar 3 00:12:36 CET 2011



-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of dms
Sent: Wednesday, March 02, 2011 3:16 PM
To: r-help at r-project.org
Subject: [R] merge( , by='row.names') slowness

I noticed that joining two data.frames  in R using the "merge"
function that using by='row.names'  slows things down substantially
when compared to just joining on a common index column.

Using a dataframe size of ~10,000 rows: it's as slow as 10 minutes in
the by='row.names' case versus merely 1 second using an index column.
Beyond the 10^6 range, it's unusably slow.


n <- 5
a <- data.frame(id=as.character(1:10^n), x=rnorm(10^n)); rownames(a)
<- a$id
b <- data.frame(id=as.character(1:10^n + 10^(n-1)), y=rnorm(10^n));
rownames(b) <- b$id

date()
fast <- merge(a, b,  all=T)
date()
slow <- merge(a, b, all=T, by='row.names')
date()


Has anybody else noticed this?
_________________________________________________

HI DMS,
Well, first off, they don't give the same answer... in fact, not even the same dimension.
Even so, from looking at merge.data.frame, it's not immediately obvious what would make a difference of this magnitude.
The answer might be buried in the internal merge.

Here for n=3:
> system.time(print(dim(merge(a,b,all=T))))
[1] 1100    3
   user  system elapsed
   0.01    0.00    0.01
> system.time(print(dim(merge(a,b,all=T,by=1))))
[1] 1100    3
   user  system elapsed
   0.01    0.00    0.02
> system.time(print(dim(merge(a,b,all=T,by=0))))
[1] 1100    5
   user  system elapsed
   3.26    0.00    3.17
> system.time(print(dim(merge(a,b,all=T,by="row.names"))))
[1] 1100    5
   user  system elapsed
   3.17    0.00    3.17
>

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. 


More information about the R-help mailing list