[R] R/S and large datasets - Database access (also Re: SAS and S/R)

David James dj at research.bell-labs.com
Tue Nov 27 19:05:34 CET 2001

The Rdbi (or perhaps simply DBI, for database interface, since it is
meant to include both R and Splus) is a simple interface to any database
management system or DBMS (so far only *relational* databases have been
considered) very similar in spirit to Java's Database Connectivity (JDBC),
Perl's Database Independent (DBI), Python's Database API.  It deals
primarily with a common set of function to interface R and Splus to
databases (PostgreSQL, Oracle, Access, MySQL, mSQL, etc.)  But we should
think of this DBI only as a first step, or the infrastructure on which
we can build more sophisticated tools.  The proxy table/variable is a
good example of such a tool.  But if it's good for PostgreSQL tables,
why not for Microsoft SQL tables? Or MySQL tables?  By having a common
interface, we hope to be able to build this sort of advanced tools
independent of the underlying DBMS.

Other applications may include the ability to attach() any database
to the search() path (together with the idea of proxy objects,
it could be helpful in some cases);  also, the possibility to do
"database apply" where we apply R functions to chunks on remote
tables.  (Roger Koenker and his colleague have an LM example, see
http://www.econ.uiuc.edu/~roger/research/rq/LM.html).  There has also
been some interest of approximating quantiles, applying GLM's, etc., to
very large datasets, but techniques like these will most likely require
new algorithms to work sequentially.

And of course, some also have pointed out (Brian Ripley, among others)
that sampling has been used quite successfully before by statisticians:-)
and thus could be quite useful in some of these cases.  I'm not aware
of any tools available yet to do this on remote DBMSes, but one would
hope that if such a tool were to be developed, it would be done on top
of the DBI so that it could be used with any DBMS.

Obviously, there's a lot to be done...


Emmanuel Charpentier wrote:
> A consensus seems to emerge : R would excel to exploratory work on 
> small/middle-sized datasets, while SAS would be able to munch much 
> larger datasets.
> However, I see the "size" problem as a red herring. The objects that 
> have to stay "in core" are usually much smaller than the dataset. For 
> example, for problems involving fixed-effects linear models, you need 
> only some matrices whose size is proportional to the square of the 
> number of *variables* and the (admittedly large) vector of residues 
> (whose size is equl to the number of observations). Other cases 
> (nonlinear mixed effects models come to mind) are not as easily tamed 
> (any iterative process (shuch as ML estimation) has to get back  to 
> original data), but at least, the time penalty involved in the use of 
> such an interface pays back by allowing you to treat problems otherwise 
> untractable.
> I am aware of at least one database access package that allows to access 
> data without dragging a whole table in memory : the RPgSql package 
> offers what it calls a "proxy variable", which is an objet that behaves, 
> for all practical purposes, as a dataframe, but is an interface to 
> database tables. I see this kind of interface as a way to avoid 
> overloading core memory with data scarcely used.
> Unfortunately, the said package is now officially orphaned by its 
> developper, which states that he now focuses on the next database access 
> standard : the Rdbi interface, which is currently under development, and 
> which I don't know a thing about.
> So the question is : do the Rdbi interface offers such a proxy to data 
> still residing in databases ?
> Or am I barking up the wrong tree and trying to (re-)invent an 
> oversophisticated virtual memory manager ?  SShould the use of a 
> suficiently large swapfile be enough for these "large dataset" problems ?
> --
>                                         Emmanuel Charpentier
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list