[R] Using plyr::dply more (memory) efficiently?

Thu Apr 29 17:46:17 CEST 2010

"Steve Lianoglou" <mailinglist.honeypot at gmail.com> wrote in message 
news:t2ybbdc7ed01004290812n433515b5vb15b49c170f5a353 at mail.gmail.com...

> Thanks for directing me to the data.table package. I read through some
> of the vignettes, and it looks quite nice.
>
> While your sample code would provide answer if I wanted to just
> compute some summary statistic/function of groups of my data.frame
> (using `by=symbol`), what's the best way to produces several pieces of
> info per subset.
>
> For instance, I see that I can do something like this:
>
> summaries[, list(counts=sum(counts), width=sum(exon.width)), by=symbol]

Yes, thats it.

> But what if I need to do some more complex processing within the
> subsets defined in `by=symbol` -- like several lines of programming
> logic for 1 result, say.
>
> I guess I can open a new block that just returns a data.table? Like:
>
> summaries[, {
>  cnts <- sum(counts)
>  ew <- sum(exon.width)
>  # ... some complex things
>  complex <- # .. result of complex things
>  data.table(counts=cnts, width=ew, cplx=complex)
>}, by=symbol]
>
> Is that right? (I mean, it looks like it's working, but maybe there's
> a more idiomatic way(?))

Yes, you got it.  Rather than a data.table at the end though, just return a 
list, its faster.
Shorter vectors will still be recycled to match any longer ones.

Or just this :

summaries[, list(
    counts = sum(counts),
    width = sum(exon.width),
    cplx = # .. result of complex things
), by=symbol]

Sounds like its working,  but could you give us an idea whether it is quick 
and memory efficient ?