[R] Processing large datasets

Roman Naumenko roman at bestroman.com
Wed May 25 16:18:48 CEST 2011


> Hi,

> On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko
> <roman at bestroman.com> wrote:
> > Hi R list,
> >
> > I'm new to R software, so I'd like to ask about it is capabilities.
> > What I'm looking to do is to run some statistical tests on quite
> > big
> > tables which are aggregated quotes from a market feed.
> >
> > This is a typical set of data.
> > Each day contains millions of records (up to 10 non filtered).
> >
> > 2011-05-24 750 Bid DELL 14130770 400
> > 15.4800 BATS 35482391 Y 1 1 0 0
> > 2011-05-24 904 Bid DELL 14130772 300
> > 15.4800 BATS 35482391 Y 1 0 0 0
> > 2011-05-24 904 Bid DELL 14130773 135
> > 15.4800 BATS 35482391 Y 1 0 0 0
> >
> > I'll need to filter it out first based on some criteria.
> > Since I keep it mysql database, it can be done through by query.
> > Not
> > super efficient, checked it already.
> >
> > Then I need to aggregate dataset into different time frames (time
> > is
> > represented in ms from midnight, like 35482391).
> > Again, can be done through a databases query, not sure what gonna
> > be faster.
> > Aggregated tables going to be much smaller, like thousands rows per
> > observation day.
> >
> > Then calculate basic statistic: mean, standard deviation, sums etc.
> > After stats are calculated, I need to perform some statistical
> > hypothesis tests.
> >
> > So, my question is: what tool faster for data aggregation and
> > filtration
> > on big datasets: mysql or R?

> Why not try a few experiments and see for yourself -- I guess the
> answer will depend on what exactly you are doing.

> If your datasets are *really* huge, check out some packages listed
> under the "Large memory and out-of-memory data" section of the
> "HighPerformanceComputing" task view at CRAN:

> http://cran.r-project.org/web/views/HighPerformanceComputing.html

> Also, if you find yourself needing to do lots of
> "grouping/summarizing" type of calculations over large data
> frame-like objects, you might want to check out the data.table package:

> http://cran.r-project.org/web/packages/data.table/index.html

> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact

I don't think data.table is fundamentally different from data.frame type, but thanks for the suggestion. 

http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf
"Just like data.frames, data.tables must fit inside RAM"

The ff package by Adler, listed in "Large memory and out-of-memory data" is probably most interesting.

--Roman Naumenko



More information about the R-help mailing list