[Rd] extracting rows from a data frame by looping over the row names: performance issues

Sat Mar 3 20:13:07 CET 2007

On 3/3/07, hpages at fhcrc.org <hpages at fhcrc.org> wrote:
> Quoting hpages at fhcrc.org:
> > In [.data.frame if you replace this:
> >
> >     ...
> >     if (is.character(i)) {
> >         rows <- attr(xx, "row.names")
> >         i <- pmatch(i, rows, duplicates.ok = TRUE)
> >     }
> >     ...
> >
> > by this
> >
> >     ...
> >     if (is.character(i)) {
> >         rows <- attr(xx, "row.names")
> >         if (typeof(rows) == "integer")
> >             i <- as.integer(i)
> >         else
> >             i <- pmatch(i, rows, duplicates.ok = TRUE)
> >     }
> >     ...
> >
> > then you get a huge boost:
> >
> >   - with current [.data.frame
> >     > system.time(for (i in 1:100) dat["1", ])
> >        user  system elapsed
> >      34.994   1.084  37.915
> >
> >   - with "patched" [.data.frame
> >     > system.time(for (i in 1:100) dat["1", ])
> >        user  system elapsed
> >       0.264   0.068   0.364
> >
>
> mmmh, replacing
>     i <- pmatch(i, rows, duplicates.ok = TRUE)
> by just
>     i <- as.integer(i)
> was a bit naive. It will be wrong if rows is not a "seq_len" sequence.
>
> So I need to be more carefull by first calling 'match' to find the exact
> matches and then calling 'pmatch' _only_ on those indices that don't have
> an exact match. For example like doing something like this:
>
>     if (is.character(i)) {
>         rows <- attr(xx, "row.names")
>         if (typeof(rows) == "integer") {
>             i2 <- match(as.integer(i), rows)
>             if (any(is.na(i2)))
>                 i2[is.na(i2)] <- pmatch(i[is.na(i2)], rows, duplicates.ok =
> TRUE)
>             i <- i2
>         } else {
>             i <- pmatch(i, rows, duplicates.ok = TRUE)
>         }
>     }
>
> Correctness:
>
>   > dat2 <- data.frame(aa=c('a', 'b', 'c', 'd'), bb=1:4,
>                        row.names=c(11,25,1,3))
>   > dat2
>      aa bb
>   11  a  1
>   25  b  2
>   1   c  3
>   3   d  4
>
>   > dat2["1",]
>     aa bb
>   1  c  3
>
>   > dat2["3",]
>     aa bb
>   3  d  4
>
>   > dat2["2",]
>      aa bb
>   25  b  2
>
> Performance:
>
>   > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
>   > dat <- as.data.frame(mat)
>   > system.time(for (i in 1:100) dat["1", ])
>      user  system elapsed
>     2.036   0.880   2.917
>
> Still 17 times faster than with non-patched [.data.frame.
>
> Maybe 'pmatch(x, table, ...)' itself could be improved to be
> more efficient when 'x' is a character vector and 'table' an
> integer vector so the above trick is not needed anymore.
>
> My point is that something can probably be done to improve the
> performance of 'dat[i, ]' when the row names are integer and 'i'
> a character vector. I'm assuming that, in the typical use-case,
> there is an exact match for 'i' in the row names so converting
> those row names to a character vector in order to find this match
> is (most of the time) a waste of time.

But why bother?  If you know the index of the row, why not index with
a numeric vector rather than a string?  The behaviour in that case
seems obvious and fast.

Hadley