[Rd] R as analysis server for very large data sets

George Ostrouchov ostrouchovg@ornl.gov
Wed Feb 19 02:07:02 2003

At ORNL, we are building a system, ASPECT (Adaptive Simulation Product 
Exploration and Control Toolkit), for analyzing output from massive 
simulations. It is essentially a client server type setup that reads 
netcdf and hdf files, and uses MPI for some distributed tasks. The total 
output of a simulation can be terabytes, but individual variables can be 
only a gigabyte and some relevant subsets even smaller. In theory, a 
single variable can be handled on a 64 bit machine with a few gigabytes 
of memory, say 10 GB. I understand that some folks have some success 
running R on a 64 bit machine.

In addition to some home-grown distributed data analysis codes, we have 
included a facility for calling a limited subset of R functions from 
ASPECT. Simple use of R on a large data set did not work well.  For 
example, computing a simple histogram consumed several times (I think it 
was 3 times) more memory than that required for the data itself. Some 
editing to the hist.default function fixed the problem, but reduced the 
generality of the function. The default seemed to generate a dimnames 
attribute that became as large as the data. It may be that our initial 
data matrix had some attributes we were not aware of.

It seems that generality and metadata generation in R run counter to R's 
ability to handle large data sets. Can someone comment on this?

Are there functions in R that will strip a variable of all its 
attributes, except the structure such as vector, matrix, or array? Or 
are there options to prevent generating more attributes in some 
functions? ... Perhaps an attribute to prevent further attributes?

Does it make sense to propose building (assuming that someone has time 
to do it) a "large data" subset of R?

Thanks for your help,

George Ostrouchov
Statistics and Data Sciences Group
Computer Science and Mathematics Division
Oak Ridge National Laboratory