[R] RMySQL vs. Rdbi

Sun Jan 16 00:22:16 CET 2005

Hello,

I know that the topics of using large datasets in R vs. SAS, using 
PostGreSQL vs MySQL, and using databases with R have been discussed 
extensively on this list and elsewhere. However I hope that I have a 
slightly new combination of the questions here.

I am doing my PhD research on a large dataset and trying to decide whether 
to use PostgreSQL or MySQL with R, or simply use SAS, to which I also have 
access. My databasing experience is currently limited to reading a few 
chapters of a textbook, but I have some general programming experience in 
C, C++, and Perl. I am leaning towards PGSQL at the moment because it seems 
to have more core functions (i.e. less writing for me to do) but  on the 
other hand, our sysadmin already has MySQL installed, and I hear that it's 
faster.

Of PostGreSQL vs. MySQL, which has the more mature interface with R?   Are 
there any issues with RMySQL or Rdbi.PGSQL (or .MySQL) that I should be 
aware of and should they influence my decision for MySQL vs. PGSQL vs. the 
SAS integrated database?

My dataset is about  26G, currently split up into files of 260 MB... about 
540,000 records with 40 "explanatory" variables, many of which are probably 
redundant but I just don't know at the moment. It was way too slow to 
work  with in R using Red Hat Linux machines with 500MB-1G RAM, especially 
when producing plots. Preprocessing using Perl scripts every time I wanted 
to look at a different subset of the data became too tedious. I hope to 
create exploratory graphics such as sunflower plots, and also try some 
lattices to help me get a feel for the data. Then I'm interested in trying 
some stepwise ANOVA,  and finally searching for patterns using discriminant 
analysis, and/or classification trees.

I would greatly appreciate any advice you might have on choosing a 
databasing software environment.

Thank you,
Jean Chung