[R] Re : Large database help

Thu May 18 01:17:29 CEST 2006

Thank you all for the discussion.

I'll try to summarize the suggestions and give some partial conclusions
for sake of completeness of this thread.

First, I had read the I/O manual but had forgotten the function read.fwf as
suggested by Roger Peng. I'm sorry. But, following manual orientation, this
function is not recommended for large files and I need to discover how to
read fixed-width-format files using scan function, since there isn't such an
example in that manual neither in ?scan. At a glance, it seems the function
read.fwf writes blank spaces among column pointers in order to read the
file using a simple scan() function.

I've also read the I/O manual, mainly chapter 4 about using Relational
Databases.
This suggestion was appointed by Uwe Ligges and Justin Bem who advocated
the use of MySQL with RMySQL package. I'm still installing MySQL to try
to convert my fixed-width-format file to that database but, from the I/O
manual, it seems I can only calculate five descriptive statistics (aggregate
functions). So I couldn't calculate medians or more advanced statistics
like a cluster analysis.
This point was one from Robert Citek and thus, I'm not sure that working
with MySQL will help to solve my problem. RMySQL has dbApply function
that apply R functions to groups (chunks) of database rows.

There was a suggestion to subset the file, by Roger Peng.
Almost all participants in this thread noted the need of lots of RAM to work
with a few variables as suggested by Prof. Brian Ripley.

The future looks promising through a collection *big* of packages specially
designed to handle big data files in almost any hardwarea and OS
configuration although time-demanding in some cases. It seems the first one
in this collection is the biglm package by Thomas Lumley cited by Greg Snow.
The obvious drawback is that one hat to re-write every package that can't
handle big data files or, al least, their most memory demanding operations.
This last point could be implemented by an option like big.file=TRUE to be
incorporated at some functions. This point of view is one of *scaling up*
the methods.

Another promising way is to *scale down* the dataset. Statisticians are
aware of these techniques from non-hierarquical cluster analysis and
principal component analysis among others (mainly sampling). Engineers
and signal processing people know them from data compression techniques. 
Computer scientists work with training sets and dataming wich use methods
to scale down datasets. An example was given by Richard M. Heiberger
who cites a paper from William DuMouchel et al. on Squashing Flat Files.
Maybe could be some R functions specialized in these methods that, using
DBMS, could retrieve significant data (records and variables) that could be
handled by R.

That's all, for a while!

Rogerio.