[R] aggregate / collapse big data frame efficiently

jim holtman jholtman at gmail.com
Tue Dec 25 19:20:49 CET 2012


According to the way that you have used 'aggregate', you are taking
the column means.  Couple of suggestions for faster processing:


1. use matrices instead of data.frames ( i converted your example just
before using it)
2, use the 'colMeans'

I created a 120 x 100000 matrix with 10 levels and its does the
computation in less than 2 seconds:


>  n <- 100000
> nLevels <- 10
> nRows <- 120
> Cols <- list(rep(list(sample(nRows)), n))
> df <- data.frame(levels = sample(nLevels, nRows, TRUE), Cols)
> colnames(df)[-1] <- paste0('col', 1:n)
>
> # convert to matrix for faster processing
> df.m <- as.matrix(df[, -1])  # remove levels column
> str(df.m)
 int [1:120, 1:100000] 111 13 106 61 16 39 25 94 53 38 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ...
> system.time({
+ # split the indices of rows for each level
+ x <- split(seq(nrow(df)), df$levels)
+ result <- sapply(x, function(a) colMeans(df.m[a, ]))
+ })
   user  system elapsed
   1.33    0.00    1.35
> str(result)
 num [1:100000, 1:10] 57 57 57 57 57 57 57 57 57 57 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ...
  ..$ : chr [1:10] "1" "2" "3" "4" ...
>


On Tue, Dec 25, 2012 at 11:34 AM, Martin Batholdy
<batholdy at googlemail.com> wrote:
> Hi,
>
>
> I need to aggregate rows of a data.frame by computing the mean for rows with the same factor-level on one factor-variable;
>
> here is the sample code:
>
>
> x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))
>
> aggregate(x, list(x[,1]), mean)
>
>
> Now my problem is, that the actual data-set is much bigger (120 rows and approximately 100.000 columns) – and it takes very very long (actually at some point I just stopped it).
>
> Is there anything that can be done to make the aggregate routine more efficient?
> Or is there a different approach that would work faster?
>
>
> Thanks for any suggestions!
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.




More information about the R-help mailing list