[R] Processing large datasets

Steve Lianoglou mailinglist.honeypot at gmail.com
Wed May 25 16:49:34 CEST 2011


Hi,

On Wed, May 25, 2011 at 10:18 AM, Roman Naumenko <roman at bestroman.com> wrote:
[snip]
> I don't think data.table is fundamentally different from data.frame type, but thanks for the suggestion.
>
> http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf
> "Just like data.frames, data.tables must fit inside RAM"

Yeah, I know -- I only mentioned in the context of manipulating
data.frame-like objects -- sorry if I wasn't clear.

If you've got data that's data.frame like that you can store in ram
AND you find yourself wanting to do some summary calcs over different
subgroups of it, you might find that data.table will be a quicker way
to get that done -- the larger your data.frame/table, the more
noticeable the speed.

To give you and idea of what scenarios I'm talking about, other
packages you'd use to do the same would by plyr and sqldf.

For out of memory datasets, you're in a different realm -- hence the
HPC Task view link.

> The ff package by Adler, listed in "Large memory and out-of-memory data" is probably most interesting.

Cool.

I've had some luck using the bigmemory package (and friends) in the
past as well.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the R-help mailing list