[R] strata -- really slow performance

hadley wickham h.wickham at gmail.com
Mon Jul 13 06:56:59 CEST 2009


> In this simple example, it took less than half a second to generate the
> result. That is on a 2.93 Ghz MacBook Pro.
>
>
> So, for your data, the code would look something like this:
>
>
> system.time(DF.new <- do.call(rbind,
>                              lapply(split(patch_summary,
> patch_summary$UniqueID),
>                                     function(x) x[sample(nrow(x), 1), ])))

For large data, you can make it even faster with

sample_rows <- function(df, n) {
  df[sample(nrow(df), n), ]
}

library(plyr)
system.time(DF.new <- ddply(DF, "ID", sample_rows, n = 1))

ddply uses some tricks to avoid copying DF which really make a
different for large data (unfortunately it also increases the overhead
so it is currently slower for small data)

Hadley


-- 
http://had.co.nz/




More information about the R-help mailing list