[Rd] extracting rows from a data frame by looping over the row names: performance issues

Sun Mar 4 01:03:07 CET 2007

Quoting hadley wickham <h.wickham at gmail.com>:

> On 3/3/07, hpages at fhcrc.org <hpages at fhcrc.org> wrote:
> > Quoting hpages at fhcrc.org:
> > > In [.data.frame if you replace this:
> > >
> > >     ...
> > >     if (is.character(i)) {
> > >         rows <- attr(xx, "row.names")
> > >         i <- pmatch(i, rows, duplicates.ok = TRUE)
> > >     }
> > >     ...
> > >
> > > by this
> > >
> > >     ...
> > >     if (is.character(i)) {
> > >         rows <- attr(xx, "row.names")
> > >         if (typeof(rows) == "integer")
> > >             i <- as.integer(i)
> > >         else
> > >             i <- pmatch(i, rows, duplicates.ok = TRUE)
> > >     }
> > >     ...
> > >
> > > then you get a huge boost:
> > >
> > >   - with current [.data.frame
> > >     > system.time(for (i in 1:100) dat["1", ])
> > >        user  system elapsed
> > >      34.994   1.084  37.915
> > >
> > >   - with "patched" [.data.frame
> > >     > system.time(for (i in 1:100) dat["1", ])
> > >        user  system elapsed
> > >       0.264   0.068   0.364
> > >
> >
> > mmmh, replacing
> >     i <- pmatch(i, rows, duplicates.ok = TRUE)
> > by just
> >     i <- as.integer(i)
> > was a bit naive. It will be wrong if rows is not a "seq_len" sequence.
> >
> > So I need to be more carefull by first calling 'match' to find the exact
> > matches and then calling 'pmatch' _only_ on those indices that don't have
> > an exact match. For example like doing something like this:
> >
> >     if (is.character(i)) {
> >         rows <- attr(xx, "row.names")
> >         if (typeof(rows) == "integer") {
> >             i2 <- match(as.integer(i), rows)
> >             if (any(is.na(i2)))
> >                 i2[is.na(i2)] <- pmatch(i[is.na(i2)], rows, duplicates.ok
> =
> > TRUE)
> >             i <- i2
> >         } else {
> >             i <- pmatch(i, rows, duplicates.ok = TRUE)
> >         }
> >     }
> >
> > Correctness:
> >
> >   > dat2 <- data.frame(aa=c('a', 'b', 'c', 'd'), bb=1:4,
> >                        row.names=c(11,25,1,3))
> >   > dat2
> >      aa bb
> >   11  a  1
> >   25  b  2
> >   1   c  3
> >   3   d  4
> >
> >   > dat2["1",]
> >     aa bb
> >   1  c  3
> >
> >   > dat2["3",]
> >     aa bb
> >   3  d  4
> >
> >   > dat2["2",]
> >      aa bb
> >   25  b  2
> >
> > Performance:
> >
> >   > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
> >   > dat <- as.data.frame(mat)
> >   > system.time(for (i in 1:100) dat["1", ])
> >      user  system elapsed
> >     2.036   0.880   2.917
> >
> > Still 17 times faster than with non-patched [.data.frame.
> >
> > Maybe 'pmatch(x, table, ...)' itself could be improved to be
> > more efficient when 'x' is a character vector and 'table' an
> > integer vector so the above trick is not needed anymore.
> >
> > My point is that something can probably be done to improve the
> > performance of 'dat[i, ]' when the row names are integer and 'i'
> > a character vector. I'm assuming that, in the typical use-case,
> > there is an exact match for 'i' in the row names so converting
> > those row names to a character vector in order to find this match
> > is (most of the time) a waste of time.
> 
> But why bother?  If you know the index of the row, why not index with
> a numeric vector rather than a string?  The behaviour in that case
> seems obvious and fast.

Because if I want to access a given row by its key (row name) then I _must_
use a string:

  > dat=data.frame(aa=letters[1:6], bb=1:6,
                   row.names=as.integer(c(51, 52, 11, 25, 1, 3)))

  > dat
     aa bb
  51  a  1
  52  b  2
  11  c  3
  25  d  4
  1   e  5
  3   f  6

If my key is "1":

  > dat["1", ]
    aa bb
  1  e  5

OK

I can't use a numeric index:

  > dat[1, ]
     aa bb
  51  a  1

Not what I want!

With a big data frame (e.g. 10**6 rows), every time I do 'dat["1", ]'
I'm charged the price of the coercion from a 10**6-element character
vector to an integer vector. A very high (and unreasonable) price that
could be easily avoided.

You could argue that I can still work around this by extracting
'attr(dat, "row.names")' myself, check its mode, and then, if its
mode is integer, use 'match' to find the position (i2) of my key in
the row.names, then finally call 'dat[i2, ]'. Is is unreasonable to
expect [.data.frame to do that for me?

Cheers,
H.

> 
> Hadley
>