[R] aggregate() runs out of memory

William Dunlap wdunlap at tibco.com
Fri Sep 14 22:22:51 CEST 2012


Using data.table will probably speed lots of things up, but also note that
  aggregate(x, FUN=length, by)$x
is a slow way to compute
  table(by)
.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of Steve Lianoglou
> Sent: Friday, September 14, 2012 12:41 PM
> To: sds at gnu.org; r-help at r-project.org
> Subject: Re: [R] aggregate() runs out of memory
> 
> Hi,
> 
> On Fri, Sep 14, 2012 at 3:26 PM, Sam Steingold <sds at gnu.org> wrote:
> > I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 columns).
> > I want to get the result of
> > table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
> > alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
> > 24.3G, and no end in sight.
> > both V1 and V2 are characters (not factors).
> > Is there anything I could do to speed this up?
> > Thanks.
> 
> You might find you'll get a lot of mileage out of data.table when
> working with such large data.frames ...
> 
> To get something close to what you're after, you can try:
> 
> R> library(data.table)
> R> Z <- as.data.table(Z)
> R> setkeyv(Z, 'V2')
> R> agg <- Z[, list(count=.N), by='V2']
> 
> >From here you might
> 
> R> tab1 <- table(agg$count)
> 
> I think that'll get you where you want to be ... I'm ashamed to say
> that I haven't really done much w/ aggregate since I mostly have used
> plyr and data.table like stuff, so I might be missing your end goal --
> providing a reproducible example with a small data.frame from you can
> help here (for me at least).
> 
> HTH,
> -steve
> 
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list