[R] Performance & capacity characteristics of R?

Wed Aug 4 08:21:39 CEST 1999

Prof Brian D Ripley wrote:
> 
> On Tue, 3 Aug 1999, Karsten M. Self wrote:

> > I'm exploring R's capabilities and limitations.  I'd be very interested
> > in having a deeper understanding of it capacity and performance
> > limitations in dealing with very large datasets, which I would classify
> > as tables with 1 million to 100s of millions of rows and two - 100+
> > fields (variables) generally of 8 bytes -- call it a 16 - 800 byte
> > record length.
> 
> Can you tell us what statistical procedures need 1 million to 100s of
> millions or rows (observations)?  Some of us have doubted that there are
> even datasets of 100,000 examples that are homogeneous and for which a
> small subsample would not give all the statistical information. (If they
> are not homogeneous, one could/should analyse homogeneous subsets and do a
> meta-analysis.)

Thank you for your response.

Fair question, and honestly I'm a bit pressed to answer it.  Most of my
own programming is data management, categorizing, and buisiness
reporting type work, much of which is better suited to SQL or a
procedural data language such as Perl, the SAS data step, or
DBMS/Program (a new product from Conceptual Software, maker of
DBMS/Copy).  Typically the intent when working with such large datasets
is to either summarize, subset, or both, the data to some set which can
be meaninfully worked with.  I'd generally agree with your assessment
that I rarely see complex statistical analysis performed on more than
10,000 to 100,000 observations, and very frequently much less.

Where statistics are computed for very large datasets, they are
typically:

 - Descriptive univariate statistics (min, max, mean, median, mode,
percentiles, skew, kurtosis, etc.).  Simple descriptive statistics can
often be computed in an SQL package, or by tools which only read one or
a few lines of data at a time (e.g.:  SAS data step, AWK).

 - GLM regressions.  In one case involving engineering data, the
underlying dataset was 200 Hz data collected over 2-12 hour perionds, or
up to 8,640,000 observations.  Typically regressions in this instance
were performed on a 30 - 600 second timeslice of data.  I'm not
generally familiar with time series analysis, but I could imagine
similar cases involving engineering data with similar data volumes.  

 - Neural nets.  A credit-card risk modeling application I am familiar
with used multiple months of transactions data from 100m accounts in
"training" the parameters for a neural net model.  AFAIK, this was a
special-purpose application.  It was not written in SAS, though portions
of supporting analysis were.

 - Data mining.  Largely a subset of the first item, univariate
statistics.  A data cube is usually a dataset with precomputed summary
statistics at several levels of aggregation.  Again, R is probably
neither necessary, appropriate, nor particularly well suited to this
sort of application.

My interest at present is more one of identifying the limits of what R
can an cannot do rather than lobbying for very large dataset support.  I
also suspect that sampling and/or subsetting methods are more
appropriate for most of the statistical operations performed on very
large datasets.  Again, most processing on such data is the subsetting,
reduction, or summarization required to get a reasonably sized dataset
on which to run "real" analysis.

> Your datasets appear to be (taking a mid-range value) around 1Gbyte
> in size.

Anywhere from several MB to 40 GB, in my experience.  Finance,
commercial transaction systems, engineering data, seismic studies, and
genomics are areas in which such large datasets are likely to occur. 
Aggregate data sizes in the terrabyte range are becoming common, though
working set sizes are usually a subset.  I'm not fully familiar with the
types of analysis which are performed in all these areas.  ROOT
(http://root.cern.ch/) was designed in part to handle such very large
datasets for advanced physics analysis, though I'm not very familiar
with this tool.

> > Can R handle such large datasets (tables)?  What are the general
> 
> R has a workspace size limit of 2048Mb, and on 32-bit machines this cannot
> be raised more than a tiny amount. I have only run R on a machine with
> 512Mb of RAM, and on that using objects of more than 100Mb or so slowed it
> down very considerably.

Is the 2048 MB limit an artifact of 32 bit address space?  Is R 64-bit
capable on 64 bit architecture (SPARC, Alpha, etc.)?

-- 
Karsten M. Self (kmself at ix.netcom.com)
    What part of "Gestalt" don't you understand?

SAS for Linux: http://www.netcom.com/~kmself/SAS/SAS4Linux.html
Mailing list:  "subscribe sas-linux" to
mailto:majordomo at cranfield.ac.uk    
  8:38am  up 70 days,  9:44,  1 user,  load average: 0.44, 0.24, 0.07
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._