[R] Using plyr::dply more (memory) efficiently?

Matthew Dowle mdowle at mdowle.plus.com
Thu Apr 29 15:52:56 CEST 2010


I don't know about that,  but try this :

install.packages("data.table", repos="http://R-Forge.R-project.org")
require(data.table)
summaries = data.table(summaries)
summaries[,sum(counts),by=symbol]

Please let us know if that returns the correct result,  and if its 
memory/speed is ok ?

Matthew

"Steve Lianoglou" <mailinglist.honeypot at gmail.com> wrote in message 
news:w2kbbdc7ed01004290606lc425e47cs95b36f6bf0ab3d at mail.gmail.com...
> Hi all,
>
> In short:
>
> I'm running ddply on an admittedly (somehow) large data.frame (not
> that large). It runs fine until it finishes and gets to the
> "collating" part where all subsets of my data.frame have been
> summarized and they are being reassembled into the final summary
> data.frame (sorry, don't know the correct plyr terminology). During
> collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB
> until I kill it.
>
> Running a similar piece of code that iterates manually w/o ddply by
> using a combo of lapply and a do.call(rbind, ...) uses considerably
> less ram (tops out at about 8GB).
>
> How can I use ddply more efficiently?
>
> Longer:
>
> Here's more info:
>
> * The data.frame itself ~ 15.8 MB when loaded.
> * ~ 400,000 rows, 8 columns
>
> It looks like so:
>
>   exon.start exon.width exon.width.unique exon.anno counts
> symbol   transcript  chr
> 1        4225        468                 0       utr      0
> WASH5P       WASH5P chr1
> 2        4833         69                 0       utr      1
> WASH5P       WASH5P chr1
> 3        5659        152                38       utr      1
> WASH5P       WASH5P chr1
> 4        6470        159                 0       utr      0
> WASH5P       WASH5P chr1
> 5        6721        198                 0       utr      0
> WASH5P       WASH5P chr1
> 6        7096        136                 0       utr      0
> WASH5P       WASH5P chr1
> 7        7469        137                 0       utr      0
> WASH5P       WASH5P chr1
> 8        7778        147                 0       utr      0
> WASH5P       WASH5P chr1
> 9        8131         99                 0       utr      0
> WASH5P       WASH5P chr1
> 10      14601        154                 0       utr      0
> WASH5P       WASH5P chr1
> 11      19184         50                 0       utr      0
> WASH5P       WASH5P chr1
> 12       4693        140                36    intron      2
> WASH5P       WASH5P chr1
> 13       4902        757                36    intron      1
> WASH5P       WASH5P chr1
> 14       5811        659               144    intron     47
> WASH5P       WASH5P chr1
> 15       6629         92                21    intron      1
> WASH5P       WASH5P chr1
> 16       6919        177                 0    intron      0
> WASH5P       WASH5P chr1
> 17       7232        237                35    intron      2
> WASH5P       WASH5P chr1
> 18       7606        172                 0    intron      0
> WASH5P       WASH5P chr1
> 19       7925        206                 0    intron      0
> WASH5P       WASH5P chr1
> 20       8230       6371               109    intron     67
> WASH5P       WASH5P chr1
> 21      14755       4429                55    intron     12
> WASH5P       WASH5P chr1
> ...
>
> I'm "ply"-ing over the "transcript" column and the function transforms
> each such subset of the data.frame into a new data.frame that is just
> 1 row / transcript that basically has the sum of the "counts" for each
> transcript.
>
> The code would look something like this (`summaries` is the data.frame
> I'm referring to):
>
> rpkm <- ddply(summaries, .(transcript), function(df) {
>  data.frame(symbol=df$symbol[1], counts=sum(df$counts))
> }
>
> (It actually calculates 2 more columns that are returned in the
> data.frame, but I'm not sure that's really important here).
>
> To test some things out, I've written another function to manually
> iterate/create subsets of my data.frame to summarize.
>
> I'm using sqldf to dump the data.frame into a db, then I lapply over
> subsets of the db `where transcript=x` to summarize each subset of my
> data into a list of single-row data.frames (like ddply is doing), and
> finish with a `do.call(rbind, the.dfs)` o nthis list.
>
> This returns the same exact result ddply would return, and by the time
> `do.call` finishes, my RAM usage hits about 8gb.
>
> So, what am I doing wrong with ddply that makes the difference ram
> usage in the last step ("collation" -- the equivalent of my final
> `do.call(rbind, my.dfs)` be more than 12GB?
>
> Thanks,
> -steve
>
> -- 
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>



More information about the R-help mailing list