[Rd] extracting rows from a data frame by looping over the row names: performance issues

hpages at fhcrc.org hpages at fhcrc.org
Sat Mar 3 09:22:23 CET 2007


Hi Seth,

Quoting Seth Falcon <sfalcon at fhcrc.org>:

> Herve Pages <hpages at fhcrc.org> writes:
> > So apparently here extracting with dat[i, ] is 300 times faster than
> > extracting with dat[key, ] !
> >
> >> system.time(for (i in 1:100) dat["1", ])
> >    user  system elapsed
> >  12.680   0.396  13.075
> >
> >> system.time(for (i in 1:100) dat[1, ])
> >    user  system elapsed
> >   0.060   0.076   0.137
> >
> > Good to know!
> 
> I think what you are seeing here has to do with the space efficient
> storage of row.names of a data.frame.  The example data you are
> working with has no specified row names and so they get stored in a
> compact fashion:
> 
>     mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
>     dat <- as.data.frame(mat)
>     
>     > typeof(attr(dat, "row.names"))
>     [1] "integer"
> 
> In the call to [.data.frame when i is character, the appropriate index
> is found using pmatch and this requires that the row names be
> converted to character.  So in a loop, you get to convert the integer
> vector to character vector at each iteration.

Maybe this could be avoided. Why do you need to call pmath when
the row names are integer?

In [.data.frame if you replace this:

    ...
    if (is.character(i)) {
        rows <- attr(xx, "row.names")
        i <- pmatch(i, rows, duplicates.ok = TRUE)
    }
    ...

by this

    ...
    if (is.character(i)) {
        rows <- attr(xx, "row.names")
        if (typeof(rows) == "integer")
            i <- as.integer(i)
        else
            i <- pmatch(i, rows, duplicates.ok = TRUE)
    }
    ...

then you get a huge boost:

  - with current [.data.frame
    > system.time(for (i in 1:100) dat["1", ])
       user  system elapsed
     34.994   1.084  37.915

  - with "patched" [.data.frame
    > system.time(for (i in 1:100) dat["1", ])
       user  system elapsed
      0.264   0.068   0.364

but maybe I'm missing somethig...

Cheers,
H.

> 
> If you assign character row names, things will be a bit faster:
> 
>     # before
>     system.time(for (i in 1:25) dat["2", ])
>        user  system elapsed 
>       9.337   0.404  10.731 
>     
>     # this looks funny, but has the desired result
>     rownames(dat) <- rownames(dat)
>     typeof(attr(dat, "row.names")
>     
>     # after
>     system.time(for (i in 1:25) dat["2", ])
>        user  system elapsed 
>       0.343   0.226   0.608 
> 
> And you probably would have seen this if you had looked at the the
> profiling data:
> 
>     Rprof()
>     for (i in 1:25) dat["2", ]
>     Rprof(NULL)
>     summaryRprof()
> 
> 
> + seth
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list