[R] Re: Need advice on using R with large datasets

Wed Apr 14 14:15:21 CEST 2004

Thank you guys for sharing your experiences in 64-bit R. Those are very helpful in my planning work.

I wonder anyone has experience in using database interface with R. Are there any "preferred" choice or "hidden catch"? In our setup, we may be using MS SQL Server or Oracle to keep the data. I know that RODBC can talk to them directly. Could there be any 32-bit/64-bit compatibility issues, say, when a 32-bit ORACLE is talking to a 64-bit R ?

Performance wise, when used with R, how does MySQL or PostgreSQL compare to MS SQL Server or Oracle? 

Any comments will be helpful. Thanks

Sunny Ho
(Hong Kong University of Science & Technology)

> Hello everyone,
> 
> I would like to get some advices on using R with some really large datasets.
> 
> I'm using RH9 Linux R 1.8.1 for a research with a lot of numerical data. The datasets total to around 200Mb (shown by memory.size). During my data manipulation, the system memory usage grew to 1.5Gb, and this caused a lot of swapping activities on my 1Gb PC. This is just a small-scale experiment, the full-scale one will be using data 30 times as large (on a 4Gb machine). I can see that I'll need to deal with memory usage problem very soon.
> 
> I notice that R keeps all datasets in memory at all times. I wonder whether there is any way to instruct R to push some of the less-frequently-used data tables out of main memory, so as to free up memory for those that are actively in used. It'll be even better if R can keep only part of a table in memory only when that part is needed. Using save & load could help, but I just wonder whether R is intelligent enough to do this by itself, so I don't need to keep track of memory usage at all times.
> 
> Another thought is to use a 64-bit machine (AMD64). I find there is a pre-compiled R for Fedora Linux on AMD64. Anyone knows whether this version of R runs as 64-bit? If so, then will R be able to go beyond the 32-bit 4Gb memory limit?
> 
> Also, from the manual, I find that the RPgSQL package (for PostgreSQL database) supports a feature "proxy data frame". Does anyone have experience with this? Can "proxy data frame" handle memory efficiently for very large datasets? Say, if I have a 6Gb database table defined as a proxy data frame, will R & RPgSQL be able to handle it with just 4Gb of memory?
> 
> Any comments will be useful. Many thanks.
> 
> Sunny Ho
> (Hong Kong University of Science & Technology)
> 
>