[R] how to make aggregation in R ?

Fri Mar 20 02:40:00 CET 2009

Try this technique.  I use it with large data objects since it is
sometime faster, and uses less memory, by using indices:

x <- read.table(textConnection("  v1 v2 n1 n2
1   a a1  1 21
2   a a1  2 22
3   a a1  3 23
4   a a2  4 24
5   a a3  5 25
6   b b1  6 26
7   b b1  7 27
8   b b2  8 28
9   b b2  9 29
10  b b2 10 30
11  c c1 11 31
12  c c2 12 32
13  c c2 13 33
14  c c2 14 34
15  c c3 15 35
16  d d1 16 36
17  d d2 17 37
18  d d3 18 38
19  d d4 19 39
20  d d4 20 40"), header=TRUE)
closeAllConnections()
# use indices to reduce memory
x.ind <- split(seq(nrow(x)), list(x$v1, x$v2), drop=TRUE)
# now aggregate using the indices
x.agg <- do.call(rbind, lapply(x.ind, function(.seg){
    data.frame(v1=x$v1[.seg[1]], v2=x$v2[.seg[1]],
        n1=sum(x$n1[.seg]), n2=sum(x$n2[.seg]))
}))

On Thu, Mar 19, 2009 at 9:09 PM, Ferry <fmi.mlist at gmail.com> wrote:
> Hi,
>
> I am trying to aggregate the sum of my test data.frame as follow:
>
> testDF <- data.frame(v1 = c("a", "a", "a", "a", "a", "b", "b", "b", "b",
> "b", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
>                     v2 = c("a1", "a1", "a1", "a2", "a3", "b1", "b1", "b2",
> "b2", "b2", "c1", "c2", "c2", "c2", "c3", "d1", "d2", "d3", "d4", "d4"),
>                     n1 = 1:20,
>                     n2 = 21:40 )
>
> testDF <- orderBy( ~ v1+v2, data = testDF)
> rownames(testDF) <- NULL
>
>> testDF
>   v1 v2 n1 n2
> 1   a a1  1 21
> 2   a a1  2 22
> 3   a a1  3 23
> 4   a a2  4 24
> 5   a a3  5 25
> 6   b b1  6 26
> 7   b b1  7 27
> 8   b b2  8 28
> 9   b b2  9 29
> 10  b b2 10 30
> 11  c c1 11 31
> 12  c c2 12 32
> 13  c c2 13 33
> 14  c c2 14 34
> 15  c c3 15 35
> 16  d d1 16 36
> 17  d d2 17 37
> 18  d d3 18 38
> 19  d d4 19 39
> 20  d d4 20 40
>>
>
> testDF.result <- aggregate(list(testDF$n1, testDF$n2), by = list(testDF$v1,
> testDF$v2), FUN = sum)
>
>> testDF.result
>   Group.1 Group.2 X1.20 X21.40
> 1        a      a1     6     66
> 2        a      a2     4     24
> 3        a      a3     5     25
> 4        b      b1    13     53
> 5        b      b2    27     87
> 6        c      c1    11     31
> 7        c      c2    39     99
> 8        c      c3    15     35
> 9        d      d1    16     36
> 10       d      d2    17     37
> 11       d      d3    18     38
> 12       d      d4    39     79
>>
>
>
> However, when I applied it to my real data, it failed. It seems that
> aggregate require more memory that I have currently (I am using WinXP,
> R2.8.0, 2GB RAM).
>
> Basically I want to perform aggregate sum on my numeric fields (in the above
> case, n1 and n2) based on condition of v1 and v2.
>
> Problem is, I have a lot more of than just two numerics and conditioning
> fields.
>
> In SQL, I would do:
> select v1, v2, sum(n1), sum(n2) from myData
> group by v1, v2;
>
> Am I using a wrong function / library (or even wrong approach)? If so, can
> you suggest which one?
>
> Any pointer is really appreciated.
>
> Thanks,
>
> Ferry
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?