[Rd] [.data.frame speedup

Martin Maechler maechler at stat.math.ethz.ch
Thu Jul 3 22:08:34 CEST 2008

>>>>> "TH" == Tim Hesterberg <timhesterberg at gmail.com>
>>>>>     on Tue, 1 Jul 2008 15:23:53 -0700 writes:

    TH> There is a bug in the standard version of [.data.frame;
    TH> it mixes up handling duplicates and NAs when subscripting rows.

    TH> x <- data.frame(x=1:3, y=2:4, row.names=c("a","b","NA"))
    TH> y <- x[c(2:3, NA),]
    TH> y

    TH> It creates a data frame with duplicate rows, but won't print.

and that's a bug, indeed
("introduced" to R version 2.5.0, when the [.data.frame  code was much
optimized for speed, with quite some care), and I have commited
a fix (and a regression test) to both R-devel and R-patched.

Thanks a lot for the bug report, Tim!

Now about your newly proposed code:
I'm sorry to say that it looks so much different from the source
code in
that I don't think we would accept it as a substitute, easily.

Could you try to provide a minimal patch against the source code
and also a selfcontained example that exhibits the speed gain
you are aiming for ?

Best regards,
Martin Maechler, ETH Zurich


    TH> On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg <timhesterberg at gmail.com>
    TH> wrote:

    >> Below is a version of [.data.frame that is faster
    >> for subscripting rows of large data frames; it avoids calling
    >> duplicated(rows)
    >> if there is no need to check for duplicate row names, when:
    >> i is logical
    >> attr(x, "dup.row.names") is not NULL (S+ compatibility)
    >> i is numeric and negative
    >> i is strictly increasing

    TH> [[alternative HTML version deleted]]

    TH> ______________________________________________
    TH> R-devel at r-project.org mailing list
    TH> https://stat.ethz.ch/mailman/listinfo/r-devel

More information about the R-devel mailing list