[Rd] [.data.frame speedup
maechler at stat.math.ethz.ch
Thu Jul 3 22:08:34 CEST 2008
>>>>> "TH" == Tim Hesterberg <timhesterberg at gmail.com>
>>>>> on Tue, 1 Jul 2008 15:23:53 -0700 writes:
TH> There is a bug in the standard version of [.data.frame;
TH> it mixes up handling duplicates and NAs when subscripting rows.
TH> x <- data.frame(x=1:3, y=2:4, row.names=c("a","b","NA"))
TH> y <- x[c(2:3, NA),]
TH> It creates a data frame with duplicate rows, but won't print.
and that's a bug, indeed
("introduced" to R version 2.5.0, when the [.data.frame code was much
optimized for speed, with quite some care), and I have commited
a fix (and a regression test) to both R-devel and R-patched.
Thanks a lot for the bug report, Tim!
Now about your newly proposed code:
I'm sorry to say that it looks so much different from the source
that I don't think we would accept it as a substitute, easily.
Could you try to provide a minimal patch against the source code
and also a selfcontained example that exhibits the speed gain
you are aiming for ?
Martin Maechler, ETH Zurich
TH> On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg <timhesterberg at gmail.com>
>> Below is a version of [.data.frame that is faster
>> for subscripting rows of large data frames; it avoids calling
>> if there is no need to check for duplicate row names, when:
>> i is logical
>> attr(x, "dup.row.names") is not NULL (S+ compatibility)
>> i is numeric and negative
>> i is strictly increasing
TH> [[alternative HTML version deleted]]
TH> R-devel at r-project.org mailing list
More information about the R-devel