[R] Processing large datasets

Hugo Mildenberger Hugo.Mildenberger at web.de
Wed May 25 17:55:56 CEST 2011


With PostgreSQL at least, R can also be used as implementation
language for stored procedures. Hence data transfers between 
processes can be avoided alltogether. 

   http://www.joeconway.com/plr/

Implemention of such a procedure in R appears to be straighforward:
 
   CREATE OR REPLACE FUNCTION overpaid (emp) RETURNS bool AS '
      if (200000 < arg1$salary) {
          return(TRUE)
      }
      if (arg1$age < 30 && 100000 < arg1$salary) {
          return(TRUE)
      }
      return(FALSE)
    ' LANGUAGE 'plr';

  CREATE TABLE emp (name text, age int, salary numeric(10,2));
    INSERT INTO emp VALUES ('Joe', 41, 250000.00);
    INSERT INTO emp VALUES ('Jim', 25, 120000.00);
    INSERT INTO emp VALUES ('Jon', 35, 50000.00);
 

  SELECT name, overpaid(emp) FROM emp;
   name | overpaid
    ------+----------
    Joe  | t
    Jim  | t
    Jon  | f
   (3 rows)


Best 



On Wednesday 25 May 2011 14:12:23 Jonathan Daily wrote:
> In cases where I have to parse through large datasets that will not
> fit into R's memory, I will grab relevant data using SQL and then
> analyze said data using R. There are several packages designed to do
> this, like [1] and [2] below, that allow you to query a database using
> SQL and end up with that data in an R data.frame.
> 
> [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
> [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html
> 
> On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko <roman at bestroman.com> wrote:
> > Hi R list,
> >
> > I'm new to R software, so I'd like to ask about it is capabilities.
> > What I'm looking to do is to run some statistical tests on quite big
> > tables which are aggregated quotes from a market feed.
> >
> > This is a typical set of data.
> > Each day contains millions of records (up to 10 non filtered).
> >
> > 2011-05-24      750     Bid     DELL    14130770        400
> > 15.4800         BATS    35482391        Y       1       1       0       0
> > 2011-05-24      904     Bid     DELL    14130772        300
> > 15.4800         BATS    35482391        Y       1       0       0       0
> > 2011-05-24      904     Bid     DELL    14130773        135
> > 15.4800         BATS    35482391        Y       1       0       0       0
> >
> > I'll need to filter it out first based on some criteria.
> > Since I keep it mysql database, it can be done through by query. Not
> > super efficient, checked it already.
> >
> > Then I need to aggregate dataset into different time frames (time is
> > represented in ms from midnight, like 35482391).
> > Again, can be done through a databases query, not sure what gonna be faster.
> > Aggregated tables going to be much smaller, like thousands rows per
> > observation day.
> >
> > Then calculate basic statistic: mean, standard deviation, sums etc.
> > After stats are calculated, I need to perform some statistical
> > hypothesis tests.
> >
> > So, my question is: what tool faster for data aggregation and filtration
> > on big datasets: mysql or R?
> >
> > Thanks,
> > --Roman N.
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 
> 
>



More information about the R-help mailing list