[R] R: machine for moderately large data

Steve Lianoglou mailinglist.honeypot at gmail.com
Fri Oct 5 19:59:45 CEST 2012


Hi,

On Fri, Oct 5, 2012 at 1:41 PM, Ista Zahn <istazahn at gmail.com> wrote:
> On Fri, Oct 5, 2012 at 12:09 PM, PIKAL Petr <petr.pikal at precheza.cz> wrote:
[snip]
>> If I compute correctly, such a big matrix (20e6*1000) needs about 160 GB just to be in memory. Are you prepared for this?
>
> This is not as outrageous as one might think -- you can get a mac pro
> with 32 gigs of memory for around $3,500


And even so, I suspect the matrices that will be worked with are
sparse, so you can get more savings there (although I'm not sure which
of the packages the OP had listed work with sparse input.

That having been said, if you don't want to sample from your data,
sometimes R isn't the best solution.

There are projects being developed to specifically deal with such big data.

For one, you might consider looking at the graphlab/graphchi stuff:

http://graphlab.org

(Graphchi is meant to process big data on a "modest" machine). If you
go to the "Toolkits" menu, you'll see they have an implementation of
kmeans++ clustering that might be suitable for your clustering
analysis (perhaps some matrix factorizations are useful here, too --
perhaps your "market basket" data can be viewed as some type of
collaborative filtering problem, in which case their collaborative
filtering toolkit is right up your alley ;-)

The OP also mentioned classification trees. Perhaps rf-ace might be useful:
http://code.google.com/p/rf-ace/

>From their website:

"""
RF-ACE implements both Random Forest (RF) and Gradient Boosting Tree
(GBT) algorithms, and is strongly related to ACE, originally outlined
in http://jmlr.csail.mit.edu/papers/volume10/tuv09a/tuv09a.pdf
"""

If you scroll down to the "case study" section of their main page, you
can there is some talk about how they used this in a distributed
manner ... perhaps it is applicable in your case as well (in which
case you might be able to rig up AWS to help you).

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact




More information about the R-help mailing list