[Rd] merge performace degradation in 2.9.1

Adrian Dragulescu adrian_d at eskimo.com
Thu Jul 9 19:05:43 CEST 2009


I have noticed a significant performance degradation using merge in 2.9.1 
relative to 2.8.1.  Here is what I observed:

   N <- 100000
   X <- data.frame(group=rep(12:1, each=N), mon=rep(rev(month.abb), each=N))
   X$mon <- as.character(X$mon)
   Y <- data.frame(mon=month.abb, letter=letters[1:12])
   Y$mon <- as.character(Y$mon)

   Z <- cbind(Y, group=1:12)

   system.time(Out <- merge(X, Y, by="mon", all=TRUE))
   # R 2.8.1 is 17% faster than R 2.9.1 for N=100000

   system.time(Out <- merge(X, Z, by=c("mon", "group"), all=TRUE))
   # R 2.8.1 is 16% faster than R 2.9.1 for N=100000

Here is the head of summaryRprof() for 2.8.1
$by.self
                    self.time self.pct total.time total.pct
sort.list               4.60     56.5       4.60      56.5
make.unique             1.68     20.6       2.18      26.8
as.character            0.50      6.1       0.50       6.1
duplicated.default      0.50      6.1       0.50       6.1
merge.data.frame        0.20      2.5       8.02      98.5
[.data.frame            0.16      2.0       7.10      87.2

and for 2.9.1
$by.self
                    self.time self.pct total.time total.pct
sort.list               4.66     39.2       4.66      39.2
nchar                   3.28     27.6       3.28      27.6
make.unique             1.42     12.0       1.92      16.2
as.character            0.50      4.2       0.50       4.2
data.frame              0.46      3.9       4.12      34.7
[.data.frame            0.44      3.7       7.28      61.3

As you notice the 2.9.1 has an nchar entry that is quite time consuming.

Is there a way to avoid the degradation in performance in 2.9.1?

Thank you,
Adrian

As an aside, I got interested in testing merge in 2.9.1 by reading the 
r-devel message from 30-May-2009 "Degraded performance with rank()" by Tim 
Bergsma, as he mentions doing merges, but only today decided to test.



More information about the R-devel mailing list