[R] aggregate() runs out of memory

Sam Steingold sds at gnu.org
Mon Nov 26 22:57:52 CET 2012


hi Steve,

> * Steve Lianoglou <znvyvatyvfg.ubarlcbg at tznvy.pbz> [2012-11-26 16:08:59 -0500]:
> On Mon, Nov 26, 2012 at 3:13 PM, Sam Steingold <sds at gnu.org> wrote:
>>> * Steve Lianoglou <znvyvatyvfg.ubarlcbg at tznvy.pbz> [2012-11-19 13:30:03 -0800]:
>>>
>>> For instance, if you want the min and max of `delay` within each group
>>> defined by `share.id`, and let's assume `infl` is a data.frame, you
>>> can do something like so:
>>>
>>> R> as.data.table(infl)
>>> R> setkey(infl, share.id)
>>> R> result <- infl[, list(min=min(delay), max=max(delay)), by="share.id"]
>>
>> perfect, thanks.
>> alas, the resulting table does not contain the share.id column.
>> do I need to add something like "id=unique(share.id)" to the list?
>> also, if there is a field in the original table infl which only depends
>> on share.id, how do I add this unique value to the summary?
>> it appears that "count=unique(country)" in list() does what I need, but
>> it slows down the process.
>
> Hmm ... I think it should be there, but I'm having  a hard time
> remember what you want.
>
> Could you please copy paste the output of `(head(infl, 20))` as
> well as an approximation of what the result is that you want.

this prints all the levels for all the factor columns and takes
megabytes.

--8<---------------cut here---------------start------------->8---
> f <- data.frame(id=rep(1:3,4),country=rep(6:8,4),delay=1:12)
> f
   id country delay
1   1       6     1
2   2       7     2
3   3       8     3
4   1       6     4
5   2       7     5
6   3       8     6
7   1       6     7
8   2       7     8
9   3       8     9
10  1       6    10
11  2       7    11
12  3       8    12
> f <- as.data.table(f)
> setkey(f,id)
> delays <- f[,list(min=min(delay),max=max(delay),count=.N,country=unique(country)),by="id"]
> delays
   id min max count country
1:  1   1  10     4       6
2:  2   2  11     4       7
3:  3   3  12     4       8
--8<---------------cut here---------------end--------------->8---

this is still too slow, apparently because of unique.
how do I speed it up?

Thanks.

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://iris.org.il
http://ffii.org http://pmw.org.il http://mideasttruth.com
Programming is like sex: one mistake and you have to support it for a lifetime.




More information about the R-help mailing list