[R] Enormous Datasets

Vadim Ogranovich vograno at evafunds.com
Thu Nov 18 22:13:12 CET 2004


Very unlikely R will be able to handle this. The problems are:

* the data set may simply not fit into the memory
* it will take forever to read from the ASCII file
* any meaningful analysis of a dataset in R typically require 5 - 10
times more memory than the size of the dataset (unless you are a real
insider and know all the knobs)


Your best strategy is probably to partition the file in meaningful
sub-categories and work with them. To save time on conversion from ASCII
you can read the sub-files into a data frame and then save the data
frame in .rda file using save(). Subsequent loading .rda files is much
faster than reading ASCII

Another strategy which is often advocated on the list is to put the data
into a data base and draw random samples of manageable size from the
database. I have no experience with this approach

HTH,
Vadim

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Thomas 
> W Volscho
> Sent: Thursday, November 18, 2004 12:11 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Enormous Datasets
> 
> Dear List,
> I have some projects where I use enormous datasets.  For 
> instance, the 5% PUMS microdata from the Census Bureau.  
> After deleting cases I may have a dataset with 7 million+ 
> rows and 50+ columns.  Will R handle a datafile of this size? 
>  If so, how?
> 
> Thank you in advance,
> Tom Volscho
> 
> ************************************        
> Thomas W. Volscho
> Graduate Student
> Dept. of Sociology U-2068
> University of Connecticut
> Storrs, CT 06269
> Phone: (860) 486-3882
> http://vm.uconn.edu/~twv00001
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>




More information about the R-help mailing list