[R] Question about the use of large datasets in R

Thomas Lumley tlumley at u.washington.edu
Thu Mar 5 08:52:29 CET 2009


On Wed, 4 Mar 2009, Vadlamani, Satish {FLNA} wrote:

> Hi:
> Sorry if this is a double post. I posted the same thing this morning and did not see it.
>
> I just started using R and am asking the following questions so that I can plan for the future when I may have to analyze volume data.
>
> 1) What are the limitations of R when it comes to handling large datasets? 
>Say for example something like 200M rows and 15 columns data frame (between >1.5 to 2 GB in size)? Will the limitation be based on the specifications of 
>the hardware or R itself?

It depends a lot on what you want to do.  The default situation in R is that all the data are loaded into memory, in which case the rule of thumb is that you want data sets no larger than 1/3 of memory. If you have, say, a system with 8Gb memory and a 64-bit version of R you should be ok.

It is often possible to work with much larger data sets than this, you just need to arrange for the whole thing not to be loaded simultaneously.  The right strategy depends on the problem.

For example, linear and generalized linear models on large data sets can be fitted with the biglm package.  The various database interface packages and the packages for netCDF and HDF5 allow subsets of a data set to be loaded easily. Packages such as bigmemory and ff allow at least some operations to be carried out on file-backed data objects.


> 2) Is R 32 bit compiled or 64 bit (on say Windows and AIX)

On AIX, 64 bit. On Windows, currently only 32-bit although there is work towards a 64-bit version.


> 4) Should I be looking at SAS also only for this reason (we do have SAS 
>in-house but the problem is that I am still not sure what we have license for, 
>etc.)

I would guess that it would be cheaper to buy hardware on which the problem can be solved in R than to buy a SAS license (last time I looked, suitable rack-mount Linux boxes were under USD3000). If you already have SAS available it would be worth looking at it. For some large-data problems it will be faster or easier to use, but not for all.


      -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle




More information about the R-help mailing list