[R] Processing large datasets/ non answer but Q on writing data frame derivative.

Mike Marchywka marchywka at hotmail.com
Wed May 25 16:55:08 CEST 2011






----------------------------------------
> Date: Wed, 25 May 2011 09:49:00 -0400
> From: roman at bestroman.com
> To: biomathjdaily at gmail.com
> CC: r-help at r-project.org
> Subject: Re: [R] Processing large datasets
>
> Thanks Jonathan.
>
> I'm already using RMySQL to load data for couple of days.
> I wanted to know what are the relevant R capabilities if I want to process much bigger tables.
>
> R always reads the whole set into memory and this might be a limitation in case of big tables, correct?

ok, now I ask, perhaps for my first R effort I will try to find source code for
data frame and make a paging or streaming derivative. That is, at least for
fixed size things, it can supply things like number of total rows but
has facilities for paging in and out of memory. Presumably all users of data
frame have to work through a limited interface which I guess could be 
expanded with various hints on " prefetch this" for example. I haven't looked
at this idea in a while but the issue keeps coming up, dev list maybe?

Anyway, for your immediate issues with a few statistics you could
probably write a simple c++ program that ultimately becomes part of
an R package. It is a good idea to see what is available but these
questions come up here a lot and the normal suggestion is "DB" which
is exactly the opposite of what you want if you have predictable
access patterns ( although even here prefetch could probably be implemented).






> Doesn't it use temporary files or something similar to deal such amount of data?
>
> As an example I know that SAS handles sas7bdat files up to 1TB on a box with 76GB memory, without noticeable issues.
>
> --Roman
>
> ----- Original Message -----
>
> > In cases where I have to parse through large datasets that will not
> > fit into R's memory, I will grab relevant data using SQL and then
> > analyze said data using R. There are several packages designed to do
> > this, like [1] and [2] below, that allow you to query a database
> > using
> > SQL and end up with that data in an R data.frame.
>
> > [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
> > [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html
>
> > On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko
> >  wrote:
> > > Hi R list,
> > >
> > > I'm new to R software, so I'd like to ask about it is capabilities.
> > > What I'm looking to do is to run some statistical tests on quite
> > > big
> > > tables which are aggregated quotes from a market feed.
> > >
> > > This is a typical set of data.
> > > Each day contains millions of records (up to 10 non filtered).
> > >
> > > 2011-05-24 750 Bid DELL 14130770 400
> > > 15.4800 BATS 35482391 Y 1 1 0 0
> > > 2011-05-24 904 Bid DELL 14130772 300
> > > 15.4800 BATS 35482391 Y 1 0 0 0
> > > 2011-05-24 904 Bid DELL 14130773 135
> > > 15.4800 BATS 35482391 Y 1 0 0 0
> > >
> > > I'll need to filter it out first based on some criteria.
> > > Since I keep it mysql database, it can be done through by query.
> > > Not
> > > super efficient, checked it already.
> > >
> > > Then I need to aggregate dataset into different time frames (time
> > > is
> > > represented in ms from midnight, like 35482391).
> > > Again, can be done through a databases query, not sure what gonna
> > > be faster.
> > > Aggregated tables going to be much smaller, like thousands rows per
> > > observation day.
> > >
> > > Then calculate basic statistic: mean, standard deviation, sums etc.
> > > After stats are calculated, I need to perform some statistical
> > > hypothesis tests.
> > >
> > > So, my question is: what tool faster for data aggregation and
> > > filtration
> > > on big datasets: mysql or R?
> > >
> > > Thanks,
> > > --Roman N.
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
>
> > --
> > ===============================================
> > Jon Daily
> > Technician
> > ===============================================
> > #!/usr/bin/env outside
> > # It's great, trust me.
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
 		 	   		  


More information about the R-help mailing list