[Rd] generic database access methods

David James David James <dj@research.bell-labs.com>
Mon, 2 Apr 2001 13:24:08 -0400 (EDT)

Hi Tim,

Thanks for putting together the Rdbi package.  Attached are my (belated) 
comments and some further suggestions/questions.


> Date: Mon, 19 Mar 2001 16:19:39 -0500
> From: "Timothy H. Keitt" <Timothy.Keitt@StonyBrook.Edu>
> User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.18 i686; en-US; 0.8) Gecko/20010215
> X-Accept-Language: en
> MIME-Version: 1.0
> To: r-devel@hypatia.math.ethz.ch
> Subject: [Rd] generic database access methods
> Content-Transfer-Encoding: 7bit
> I've been putting together a package that defined generic methods for 
> database access.  The packages is called "Rdbi."  It borrows as much as 
> possible from existing database packages / proposals.  I'd like to start 
> a discussion about the proposed interface.  Here's what I've come up 
> with so far:
> #
> # Rdbi: connectionMethods.R
> #
> dbConnect <- function(dbObj, ...) UseMethod("dbConnect")
> #
> # dbObj should be generated by an implementation defined function,
> # e.g., RPgSQL(), which returns a object with an approprite class
> # name, e.g., "PostgreSQL" or "MySQL".  The implementation then
> # provides a method that actually opens the connection.
> #
> dbConnect.default <- function(dbObj, ...) stop("Invalid database class")
> #
> # User passed something other than a defined database class
> #
> dbDisconnect <- function(conn) UseMethod("dbDisconnect")
> #
> # Close connection and free resources
> #
> dbDisconnect.default <- function(conn) stop("Invalid connection object")
> dbReconnect <- function(conn) UseMethod("dbReconnect")
> #
> # The conn object should store the options passed to dbConnect so that
> # it can reestablish the connection.
> #
> dbReconnect.default <- function(conn) stop("Invalid connection object")
> #
> # User passed something other than a defined connection class
> #
> dbConnectionOK <- function(conn) UseMethod("dbConnectionOK")
> #
> # Returns TRUE if the connection object is valid and the connection
> # can be used to communicate with the database backend.
> #
> dbConnectionOK.default <- function(conn) FALSE
> #
> # User passed something other than a defined connection class
> #

I think that implicitly there's a class "dbConnection"
that the various implementations RPgSQL, RMySQL, RODBC, ROracle, etc.,
would inherit from (extend).  Right?  Having the various implementations
inherit from a common base class, say, "dbConnection", could be
useful for grouping common "database connection" functionality --
for instance, imagine we want to implement some functionality
on top of the common Rdbi, and thus for *any* DBMS.  I could easily
imagine moving your idea of proxy data.frames from the RPgSQL
package to, say, proxyTable, so that it can be used with ROracle, RODBC,
etc.  Another example would be the ability to attach(conn) and then
have R manage the reading/writing of tables,  e.g.,

  (See John M. Chambers "Data Management in S" in
  http://stat.bell-labs.com/stat/doc/doc/93.15.ps.gz and
  and "Database Classes" http://stat.bell-labs.com/stat/Sbook/DatabaseClasses)

  > con <- dbConnect("RPgSQL", ...)
  > attach(con)
  > ls(pos=2)
  [1] table1  table2 ...
  > median(table1$fieldX)
  > exists("table2", where=2)
  > remove("table4", where=2)
  > assign("newTable4", my.results, where = 2)

We could implement this for any DBMS *provided* we define 
the various R methods read/write/exists methods in terms of the
base class "dbConnection" and not in terms of the individual
RMySQL, RODBC, RPgSQL (otherwise each new R<dbms> driver would be required
to implement the attached methods, the proxyTable methods, ....)

Following the approach above, it would be very simple to attach
a DBMS and treat *all* the tables as db.proxy independently of the
actual DBMS (provided the methods are, again, defined in terms of
the common base "dbConnection").

> #
> # Rdbi: ioMethods.R
> #
> dbSendQuery <- function(conn, ...) UseMethod("dbSendQuery")
> #
> # This function submits a query string to the backend.  If the query
> # generates any results, they should be returned in a data frame;
> # otherwise the function returns NULL.  Printing error messgages is
> # left to the implementation, but this function should not issue a
> # call to error() or stop().  Error messages may be retrievable from
> # the connection object depending on implementation.
> #
> dbSendQuery.default <- function(conn, ...) stop("Invalid connection object")
> #
> # This means that the method was called on something other than a
> # connection object.
> #

I would suggest that, if the the query generates a result, dbSendQuery 
retuns a "result set" or "cursor" object.  Imagine the result is 
huge -- that could easily kill the R process.  This object ("cursor" or
"result set") could have some methods to fetch(cur, n = num.rows),
plus some self-description to show its state -- e.g. to show the
query being procesed, plus methods to tell us how many rows we've
fetched so far, etc.  Even if dbSendQuery() does not generate output,
we may still want to have some way of querying what happen with the
operation;  for instance, how many rows were affected by the operation

    > rs <- dbSendQuery(conn, "delete from TABLE1 where id = 1234")
    > if(rowsAffected(rs) != 1){
         ## oops, something isn't quite right!
    +    rollback(conn)
    + }

In many situation, users may not want to bother with this level
of detail -- all they may want is to get the data.  For this I'd follow
the ODBC(?) "execute immediate" idea and implement a very similar 
method for those cases when you know want to do the "dbSendQuery" 
and "fetch" in one scoop, say "dbExec" or something like it.

> dbListTables <- function(conn, ...) UseMethod("dbListTables")
> #
> # Output a list of table names associated with the connection object.
> # Output should look like output from ls().
> #
> dbListTables.default <- function(conn, ...) stop("Invalid connection 
> object")
> dbReadTable <- function(conn, ...) UseMethod("dbReadTable")
> #
> # Analog of the read.table() function.  Returns a data frame.  Some
> # facility for converting DB types to R types is helpfull.  Generally,
> # this will just pass a "select * from mytable" to dbSendQuery() which
> # returns the data frame.
> #
> dbReadTable.default <- function(conn, ...) stop("Invalid connection object")
> dbWriteTable <- function(conn, ...) UseMethod("dbWriteTable")
> #
> # Analog of write.table().  It should assemble input into a data frame
> # and write the frame into a table in database.
> #
> dbWriteTable.default <- function(conn, ...) stop("Invalid connection 
> object")

Yes. This would work nicely, specially for packages that use the Rdbi.


The following are some issues that have been brought up
either through private conversations and/or during the DSC 2001

1. Data Type Conversions:

Another set of generic functions that we may need to consider 
could include functions to specify data conversions: for instance,
a generic getDataType(dbObj, s.atomic.object) that would return the
appropriate data type on the remote DBMS associated with "conn".
For instance, if "ora" and "pg" are Oracle and PostgreSQL connections,
respectively, the expressions

  > x.int <- as.integer(x)
  > ora.int <- getDataType(ora, x.int)
  > pg.int <- getDataType(pg, x.int)	

would return possibly different DBMS-specific SQL data type for the same   
integer (and possibly different for the same DBMS, in the case
of R and Splus, as Brian Ripley has pointed out, I believe in the 
R Data Import/Export manual(?)).

We would also need the reverse DBMS type -> R/S object.

Finally, there'll be times that we cannot automatically convert
non-primitive objects (images? sound?) to/from DBMS and R/S -- we
should have a mechanism to allow user to specify their own conversion
functions.  We need to think of a "user-friendly" mechanism to do this.

2. DBMS vs R/S identifiers:

Do we need/want to handle for the user the mapping between
remote identifiers (e.g., "TABLE_ABC") and R/S?  This could
be easily implemented using the ideas in data mapping ideas above.

3. Transactions:

Do we want to define generic functions to manage transactions? 

4. Remote Objects:

If we decide that having "cursor/result set" objects is desirable,
then we should note that they are similar to "dbConnection" objects
in the sense that they are "remote object," i.e., R objects that reference
remote objects, (remote db object, to be more precise).  All these remote
objects may share some common functionality:  all remote objects need 
to be validated before they are passed to the DBMS (e.g., dbConnectionOk),
we need to ensure that we free the (remote) resources even in the
case of errors/exceptions, perhaps through judicious use of on.exit()
(see also Luke Tierney's "Simple References with Finalization"
http://developer.r-project.org/simpleref.html, but keep in mind that 
this may/may not be portable to S-plus.)

5. Naming convention:

We could use (as you are suggesting) a naming convention consisting
of concatenated words with capitalization for non-starting 
words, e.g., dbDisconnect).  The traditional R/S way is to separate 
words with ".", e.g., "read.table".  I don't have too strong a preference,
but perhaps we should be explicit about it?

> T.
> -- 
> Timothy H. Keitt
> Department of Ecology and Evolution
> State University of New York at Stony Brook
> Phone: 631-632-1101, FAX: 631-632-7626
> http://life.bio.sunysb.edu/ee/keitt/
> r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch

Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Phone: 631-632-1101, FAX: 631-632-7626

r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch