[R] Can R handle medium and large size data sets?

Philippe Grosjean phgrosjean at sciviews.org
Tue Jan 24 16:03:22 CET 2006


Hello,

This is not true that R cannot handle matrices of 100 000's 
observations... but:
- Importation (typically using read.table() and the like) "saturates" 
much faster. Solution: use scan() and fill a preallocated matrix, or 
better, use a database.

- Data frames are very nice objects, but if you handle only numeric 
data, do prefer matrices: they consume less memory. Also, avoid using 
row/column names for very large matrices/data frames.

- Finally, of course, your mileage varies greatly depending on the 
calculation you do on your data.

In general, the relatively widely admitted idea that R cannot handle 
large datasets originates from: using read.table() / data frames / non 
optimized code.

As an example, I can create a matrix of 150 000 observations (you don't 
tell us how many variables, so, I took 20 columns) filled with random 
numbers, and calculate the mean for each variable very easily. Here it is:

 > gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells 168994  4.6     350000  9.4   350000  9.4
Vcells  62415  0.5     786432  6.0   290343  2.3
 > system.time(a <- matrix(runif(150000 * 20), ncol = 20))
[1] 0.48 0.05 0.55   NA   NA
 > # Just a little bit more than half a second to create a table of
 > # 3 millions entries filled with random numbers (P IV, 3Ghz, Win XP)
 > dim(a)
[1] 150000     20

 > system.time(print(colMeans(a)))
  [1] 0.4998859 0.5004760 0.4994155 0.5000711 0.5005029
  [6] 0.4999672 0.5003233 0.5000419 0.4997827 0.5004858
[11] 0.5004905 0.4993428 0.4991187 0.5000143 0.5016212
[16] 0.4988943 0.4990586 0.5009718 0.4997235 0.5001220
[1] 0.03 0.00 0.03   NA   NA
 > # 30 milliseconds to calculate the mean of all 20
 > # variables over 150 000 observations

 > gc()
           used (Mb) gc trigger (Mb) max used (Mb)
Ncells  169514  4.6     350000  9.4   350000  9.4
Vcells 3062785 23.4    9317558 71.1  9062793 69.2
 > # Less than 30 Mb used (with a peak at 80 Mb)

Isn't it manageable?
Best,

Philippe Grosjean

Gueorgui Kolev wrote:
> Dear R experts,
> 
> Is it true that R generally cannot handle  medium sized data sets(a
> couple of hundreds of thousands observations) and threrefore large
> date set(couple of millions of observations)?
> 
> I googled and I found lots of questions regarding this issue, but
> curiously there were no straightforward answers what can be done to
> make R capable of handling data.
> 
> Is there sth inherent in the structure of R that makes it impossible
> to work with say 100 000observations and more? If it is so, is there
> any hope that R can be fixed in the future?
> 
> My experience is rather limited---I tried to load a Stata data set of
> about 150000observations(which Stata handles instantly) using the
> library "foreign". After half an hour R was still "thinking" so I
> stopped the attempts.
> 
> Thank you in advance,
> 
> Gueorgui Kolev
> 
> Department of Economics and Business
> Universitat Pompeu Fabra
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> 
>




More information about the R-help mailing list