[R] Performance of 'by' and 'ddply' on a large data frame

Fri Nov 20 21:04:07 CET 2009

A faster solution using tapply was sent to me via email:

testtapply = function(p){
   df = randomdf(p)
   system.time({res = tapply(df$x2,df$x1,min);
                res = as.Date(res,origin=as.Date('1970-01-01'));
                df$mindate = res[as.character(df$x1)]})
}

Thanks Phil!

Tahir

On Thu, Nov 19, 2009 at 5:19 PM, Tahir Butt <tahir.butt at gmail.com> wrote:
> I've only recently started using R. One of the problems I come up
> against is after having extracted a large dataset (>5M rows) out of
> database, I realize I need another variable. In this case I have data
> frame with dates. I want to find the minimum date for each value of x1
> and add that minimum date to my data.frame.
>
>> randomdf <- function(p) {
> data.frame(x1=sample(1:10^4, 10^p, replace=T),
> x2=sample(seq.Date(Sys.Date() - 356*3,Sys.Date(), by="day"), 10^p, replace=T),
> y1=sample(1:100, 10^p, replace=T))
> }
>> testby <- function(p) {
> df <- randomdf(p)
> system.time(by(df, df$x1, function(dfi) { min(dfi$x2) }))
> }
>> lapply(c(1,2,3,4,5), testby)
> [[1]]
>   user  system elapsed
>  0.006   0.000   0.006
>
> [[2]]
>   user  system elapsed
>  0.024   0.000   0.025
>
> [[3]]
>   user  system elapsed
>  0.233   0.000   0.234
>
> [[4]]
>   user  system elapsed
>  1.996   0.026   2.022
>
> [[5]]
>   user  system elapsed
> 11.030   0.000  11.032
>
> Strangely enough, not sure why this is, the result of by with the min
> function is not date objects but instead integers representing days
> from an origin. Is there a min function that would return me a date
> instead of an integer? Or is this a result of using by?
>
> I also wanted to see how ddply compares.
>
>> testddply <- function(p) { pdf <- randomdf(p); system.time(ddply(pdf, .(x1), function(df) { return (data.frame(min(df$x2))) })) }
>> lapply(c(1,2,3,4,5), testddply)
> [[1]]
>   user  system elapsed
>  0.020   0.000   0.021
>
> [[2]]
>   user  system elapsed
>  0.119   0.000   0.119
>
> [[3]]
>   user  system elapsed
>  1.008   0.000   1.008
>
> [[4]]
>   user  system elapsed
>  8.425   0.001   8.428
>
> [[5]]
>   user  system elapsed
>  23.070   0.000  23.075
>
> Once the data frame gets above 1M rows, the timings are a bit too long
> (on a previous run it went up to 8000s user time). This seems quite a
> bit slower than I expected. Maybe there's a better and faster way to
> add such variables to a data frame that are derived using some
> aggregation.
>
> Also, ddply seems to take twice as long as by. Are these two
> operations not equivalent?
>
> Thanks,
> Tahir
>