[R] aggregate() runs out of memory

David Winsemius dwinsemius at comcast.net
Mon Nov 19 23:30:57 CET 2012


On Nov 19, 2012, at 1:25 PM, Sam Steingold wrote:

> Thanks Steve,
> what is the analogue of .N for min and max?

?seq

> i.e., what is the data.table's version of
> aggregate(infl$delay,by=list(infl$share.id),FUN=min)

> aggregate(infl$delay,by=list(infl$share.id),FUN=max)

> DT[, list( max(v)), by=x]
   x V1
1: a  3
2: b  6
3: c  9


> thanks!
> Sam.
> 
> On Fri, Sep 14, 2012 at 3:40 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>> Hi,
>> 
>> On Fri, Sep 14, 2012 at 3:26 PM, Sam Steingold <sds at gnu.org> wrote:
>>> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 columns).
>>> I want to get the result of
>>> table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
>>> alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
>>> 24.3G, and no end in sight.
>>> both V1 and V2 are characters (not factors).
>>> Is there anything I could do to speed this up?
>>> Thanks.
>> 
>> You might find you'll get a lot of mileage out of data.table when
>> working with such large data.frames ...
>> 
>> To get something close to what you're after, you can try:
>> 
>> R> library(data.table)
>> R> Z <- as.data.table(Z)
>> R> setkeyv(Z, 'V2')
>> R> agg <- Z[, list(count=.N), by='V2']
>> 
>> From here you might
>> 
>> R> tab1 <- table(agg$count)
>> 
>> I think that'll get you where you want to be ... I'm ashamed to say
>> that I haven't really done much w/ aggregate since I mostly have used
>> plyr and data.table like stuff, so I might be missing your end goal --
>> providing a reproducible example with a small data.frame from you can
>> help here (for me at least).
>> 
>> HTH,
>> -steve
>> 
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
> 
> 
> 
> --
> Sam Steingold <http://sds.podval.org> <http://www.childpsy.net/>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Alameda, CA, USA




More information about the R-help mailing list