[R] R for large data sets

Wed Jan 16 18:03:00 CET 2002

On Wed, 16 Jan 2002, [ISO-8859-1] José Ernesto Jardim wrote:

> Something like
>
> select 'column' from mat where 'column' > 0
>
> The difference is that in sql tables, columns must have names, so
> instead of using a relative reference like mat[,1] you should use the
> name of that column. The rest is very intuitive. The SQL language is
> more "human like" than R/S so it becomes easier to work with subsets.

The difference I believe is that is not the same operation as the S
indexing asked for, which shows that something is not intuitive!  I think
you want

select * from mat where 'column' > 0

but beware that whereas S will do the necessary coercions for ` > ', many
SQL dialects will not.

> I think that everyone that works with the S language will learn the
> basics of SQL very fast and will gain a lot in working with large datasets.

Possibly.  I find S indexing a lot more powerful, and a lot faster for
datasets that fit into S's memory (and that is pretty large: we regularly
use 100Mb datasets).

> Take a look at http://www.sqlcourse.com
>
> Regards
>
> EJ
>
> Agustin Lobo wrote:
>
> >This is really elegant, Ernesto. The only problem is
> >geting used to the database language also. Do you
> >have a sort of small dictionary R-MySQL
> >for the (few) subseting procedures that we
> >commonly use in R? For example,
> >how would you say mat[mat[,1]>0,] in
> >MySQL?
> >
> >Agus
> >
> >On 16 Jan 2002, Ernesto Jardim wrote:
> >
> >>Hi
> >>
> >>I'm using some large datasets and I found the ROracle package to be of
> >>great help.
> >>
> >>If you have the chance to create a database in Oracle or MySQL with one
> >>single table for your dataset, you can then use the ROracle package to
> >>access the dataset. I found several advantages on that.
> >>
> >>I don't import the data into my environment. I use a small function (see
> >>below) to access the dataset and because the result is a data.frame you
> >>can use it as usually.
> >>
> >>Your environment will not be to large and you'll have the ram memory
> >>less full.
> >>
> >>It's easier to select subsets with SQL than S/R language.
> >>
> >>Hope it helps
> >>
> >>Regards
> >>
> >>EJ
> >>
> >>--//--
> >>
> >>ora.fun <- function(){
> >>
> >>        library(ROracle)
> >>        m <- dbManager("Oracle")
> >>        con <- dbConnect(m,user="user",password="password")
> >>        dat <- quickSQL(con,"select ...")
> >>        close(con)
> >>        unload(m)
> >>        dat
> >>
> >>}
> >>
> >>--//--
> >>
> >>On Tue, 2002-01-15 at 19:43, Prof Brian Ripley wrote:
> >>
> >>>On Tue, 15 Jan 2002, wei, xiaoyan wrote:
> >>>
> >>>>As a part of our regular data analysis, I have to read in large data sets
> >>>>with six columns and about a million rows. In Splus, this usually take a
> >>>>couple of minutes. I just tried R, it seems take forever to use read.table()
> >>>>to read in the data frame! It did not help much even though I specified
> >>>>colClasses and nrows in read.table().
> >>>>
> >>>>How is R's ability to analyze large data sets? I used R on solaris 2.6 and I
> >>>>used all default compilation flags when building the R package. Will it help
> >>>>if I use some compilation flags with higher optimization level?
> >>>>
> >>>It will help to use R-patched, since I guess you are using 1.4.0.
> >>>Also, look in the list archives, as I answered this more fully earlier
> >>>today.
> >>>
> >>>In either S-PLUS or R, scan would be a better choice for such a dataset.
> >>>
> >>>--
> >>>Brian D. Ripley,                  ripley at stats.ox.ac.uk
> >>>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> >>>University of Oxford,             Tel:  +44 1865 272861 (self)
> >>>1 South Parks Road,                     +44 1865 272860 (secr)
> >>>Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> >>>
> >>>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> >>>r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> >>>Send "info", "help", or "[un]subscribe"
> >>>(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> >>>_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> >>>
> >>
> >>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> >>r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> >>Send "info", "help", or "[un]subscribe"
> >>(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> >>_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> >>
>
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._