[R] Large Datasets

David Winsemius dwinsemius at comcast.net
Fri Feb 11 18:04:32 CET 2011


On Feb 11, 2011, at 7:51 AM, John Filben wrote:

> I have recently been using R - more speciifcally the GUI packages  
> Rattle
> and Rcmdr.
>
> I like these products a lot and want to use them for some projects -  
> the problem
> that I run into is when I start to try and run large datasets  
> through them.  The
> data sets are 10-15 million in record quantity and usually have  
> 15-30 fields
> (both numerical and categorical).

You could instead just buy memory. 32GB ought to be sufficient for  
descriptives and regression. Might even get away with 24.

>
> I saw that there were some packages that could deal with large  
> datasets in R -
> bigmemory, ff, ffdf, biganalytics.  My problem is that I am not much  
> of a coder
> (and the reason I use the above mentioned GUIs).  These GUIs do show
> the executable R code in the background - my thought was to run a  
> small sample
> through the GUI, copy the code, and then incorporate some of the  
> large data
> packages mentioned above - have anyone every tried to do this and  
> would you have
> working examples.  In terms of what I am trying to do to the data -  
> really
> simple stuff - desriptive statistics,

Should be fine here.

> k-means clustering, and possibly some decision trees.

Not sure how well those scale to tasks as large as what you propose,  
especially since you don't mention packages or functions. Not sure  
they don't, either.

-- 
David.
>   Any help would be greatly appreciated.
>
> Thank you - John
> John Filben
-- 

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list