[R] R/S and large datasets - Database access (also Re: SAS and S/R)

Timothy H. Keitt tklistaddr at keittlab.bio.sunysb.edu
Wed Nov 28 19:27:36 CET 2001


Emmanuel Charpentier wrote:

> A consensus seems to emerge : R would excel to exploratory work on 
> small/middle-sized datasets, while SAS would be able to munch much 
> larger datasets.
>
> However, I see the "size" problem as a red herring. The objects that 
> have to stay "in core" are usually much smaller than the dataset. For 
> example, for problems involving fixed-effects linear models, you need 
> only some matrices whose size is proportional to the square of the 
> number of *variables* and the (admittedly large) vector of residues 
> (whose size is equl to the number of observations). Other cases 
> (nonlinear mixed effects models come to mind) are not as easily tamed 
> (any iterative process (shuch as ML estimation) has to get back  to 
> original data), but at least, the time penalty involved in the use of 
> such an interface pays back by allowing you to treat problems 
> otherwise untractable.
>
> I am aware of at least one database access package that allows to 
> access data without dragging a whole table in memory : the RPgSql 
> package offers what it calls a "proxy variable", which is an objet 
> that behaves, for all practical purposes, as a dataframe, but is an 
> interface to database tables. I see this kind of interface as a way to 
> avoid overloading core memory with data scarcely used.
>
> Unfortunately, the said package is now officially orphaned by its 
> developper, which states that he now focuses on the next database 
> access standard : the Rdbi interface, which is currently under 
> development, and which I don't know a thing about.
>
> So the question is : do the Rdbi interface offers such a proxy to data 
> still residing in databases ?
>
> Or am I barking up the wrong tree and trying to (re-)invent an 
> oversophisticated virtual memory manager ?  SShould the use of a 
> suficiently large swapfile be enough for these "large dataset" problems ?
>
The problem with proxy data frames is that you can't pass them to 
functions like 'lm' (at least when I tried it long ago), because the 
functions that make the proxy object look like a data frame only exist 
at the R level. When you drop down to internal C code, you call a 
different set of (non-overloadable) functions, so it just appears as a 
scalar object. Duncan's news about the generic "attach" interface may 
soon make this possible however. Actually, I've found that having 
learned some SQL, I now find it indespensible. As you say, generally you 
only work with a small subset of your data, and SQL queries is the best 
way I've found to do the subsetting.

Also, there has been some recent discussion of a proposed generic DBI 
interface for R/S. Rdbi was my attempt (actually what I originally set 
out to do with RPgSQL, but some necessary internal functions were not 
yet documented or in some cases not yet implemented). We more-or-less 
settled on David James' proposal, but I do not know if anyone is 
actually implementing it. It would be nice to have a reference 
implementation so we can try it out and see what we do or don't like. I 
hope to see all of this resolved soon as I have less and less time to 
put into it and my interests are moving elsewhere (e.g., more GIS 
capabilities).

T.

-- 
Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list