[R] Collapsing data frame; aggregate() or better function?

Fri Sep 14 20:49:21 CEST 2007

Here is a way that I have used when the data get big.  The 'trick' is
to create the key (in your case concatenating columns 1:8) and then
creating a list of indices (row numbers) of the dataframe that
correspond to the grouping (using split).  The you have the lapply
operate on the list of indices and index into the data to perform the
operations.

I created some test data of your size and here was the result:

> row <- 16468
> col <- 217
> x <- matrix(sample(1:4, row * col, TRUE), row, col)
> x.df <- as.data.frame(x)
> # create the indices by concatenating the fields
> y <- do.call('paste', x.df[1:8])
> z <- split(seq(nrow(x.df)), y)  # create a list of the indices
> system.time({
+ ans <- lapply(z, function(.rows){
+     colSums(x.df[.rows, c(11,12,17:217)])
+ })
+ })
   user  system elapsed
 147.57    1.15  197.42
>
> # combine back into a dataframe
> ans <- do.call('rbind', ans)
> ans[1:10, 1:7]
                V11 V12 V17 V18 V19 V20 V21
1 1 1 1 1 1 2 1   1   4   2   2   2   3   3
1 1 1 1 1 1 2 3   4   4   4   2   1   2   2
1 1 1 1 1 1 3 1   1   3   2   2   1   3   2
1 1 1 1 1 1 4 2   1   1   2   4   4   4   2
1 1 1 1 1 1 4 3   1   4   4   3   3   2   4
1 1 1 1 1 2 4 1   2   3   2   1   3   4   1
1 1 1 1 1 2 4 2   2   2   4   3   4   4   3
1 1 1 1 1 2 4 4   2   4   4   3   2   2   3
1 1 1 1 1 3 4 1   3   3   4   4   1   2   1
1 1 1 1 1 3 4 3   1   3   4   3   1   3   2
>

Printed out the first couple of rows.  The row labels are the
concatented values.

On 9/14/07, Tobin, Jared <TobinJR at dfo-mpo.gc.ca> wrote:
> Thanks for the quick reply Jim.
>
> I haven't had any success when I whittle down 'by' list even further
> though.  I believe I'm using the right command, but now it's just a
> matter of clear memory issues.
>
> > test <- aggregate(lf1.turbot[,17:217], list(lf1.turbot$vessel,
> lf1.turbot$trip, lf1.turbot$set), sum)
> Error: cannot allocate vector of size 237.4 Mb In addition: Warning
> messages:
> 1: Reached total allocation of 734Mb: see help(memory.size)
> 2: Reached total allocation of 734Mb: see help(memory.size)
> 3: Reached total allocation of 734Mb: see help(memory.size)
> 4: Reached total allocation of 734Mb: see help(memory.size)
>
> A fellow kindly emailed me directly and suggested trying Wickham's
> 'reshape' package, but again when using the melt() function in that
> package I run into memory problems.  A colleague suggested I 'create
> factors using as.factor() and feed this directly into the appropriate
> apply function', but I've had no success with this when using tapply().
>
> Any suggestions as to a less memory-intensive procedure would be greatly
> appreciated.
>
> Thanks,
>
> --
>
> jared tobin, student research assistant
> fisheries and oceans canada
> tobinjr at dfo-mpo.gc.ca
>
> -----Original Message-----
> From: jim holtman [mailto:jholtman at gmail.com]
> Sent: Thursday, September 13, 2007 6:49 PM
> To: Tobin, Jared
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] Collapsing data frame; aggregate() or better function?
>
> The second argument for aggregate is supposed to be a list, so try
> (notice the missing comma before "1:8"):
>
> test <- aggregate(lf1.turbot[,c(11, 12, 17:217)], lf1.turbot[1:8],sum)
>
>
> On 9/13/07, Tobin, Jared <TobinJR at dfo-mpo.gc.ca> wrote:
> > Hello r-help,
> >
> > I am trying to collapse or aggregate 'some' of a data frame.  A very
> > simplified version of my data frame looks like:
> >
> > > tester
> >  trip set num sex lfs1 lfs2
> > 1  313  15   5   M    2    3
> > 2  313  15   3   F    1    2
> > 3  313  17   1   M    0    1
> > 4  313  17   2   F    1    1
> > 5  313  17   1   U    1    0
> >
> > And I want to omit sex from the picture and just get an addition of
> > num, lfs1, and lfs2 for each unique trip/set combination.  Using
> > aggregate() works fine here,
> >
> > > test <- aggregate(tester[,c(3,5:6)], tester[,1:2], sum) test
> >  trip set num lfs1 lfs2
> > 1  313  15   8    3    5
> > 2  313  17   4    2    2
> >
> > But I'm having trouble getting the same function to work on my actual
> > data frame which is considerably larger.
> >
> > > dim(lf1.turbot)
> > [1] 16468   217
> > > test <- aggregate(lf1.turbot[,c(11, 12, 17:217)], lf1.turbot[,1:8],
> > sum)
> > Error in vector("list", prod(extent)) : vector size specified is too
> > large In addition: Warning messages:
> > 1: NAs produced by integer overflow in: ngroup * (as.integer(index) -
> > one)
> > 2: NAs produced by integer overflow in: group + ngroup *
> > (as.integer(index) - one)
> > 3: NAs produced by integer overflow in: ngroup * nlevels(index)
> >
> > I'm guessing that either aggregate() can't handle a data frame of this
>
> > size OR that there is an issue with 'omitting' more than one variable
> > (in the same way I've omitted sex in the above example).  Can anyone
> > clarify and/or recommend any relatively simple alternative procedure
> > to accomplish this?
> >
> > I plan on trying variants of by() and tapply() tomorrow morning, but
> > I'm about to head home for the day.
> >
> > Thanks,
> >
> > --
> >
> > jared tobin, student research assistant fisheries and oceans canada
> > tobinjr at dfo-mpo.gc.ca
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem you are trying to solve?
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?