[R] problems with large data II

Liaw, Andy andy_liaw at merck.com
Fri Jan 9 15:53:58 CET 2004


If you have a large enough machine, you'll be able to run randomForest with
that size data (we have done that regularly).  One thing that many people
don't seem to realize is that the "formula interface" has significant
overhead.  For large data sets, try running randomForest without using the
formula.  Other tips are: If you don't need to predict future data, set
keep.forest to FALSE.  Storing the forest takes lots of memory.  If you
already have the test set data, give it to randomForest along with the
training data, instead of using predict() afterward.  If you have a
classification problem, try using the sampsize option to reduce the number
of cases used to grow each tree.

As to the problem of having categorical predictors with more than 32
categories:  Prof. Breiman's new version can deal with categorical
predictors with (IMHO) obscene number of categories.  However I have chosen
to give that a very low priority for adding to the R package.  The reason is
that, IMHO, such variables need some massaging (collapsing/merging/whatever)
before they will be somewhat meaningful in a model, anyway.  (And personally
I have no need for such feature.)

HTH,
Andy

> From: PaTa PaTaS
> 
> Thank you all for your help. The problem is not only with 
> reading the data (5000 cases times 2000 integer variables, 
> imported either from SPSS or TXT file) into my R 1.8.0 but 
> also with the procedure I would like to use = "randomForest" 
> from library "randomForest". It is not possible to run it 
> with such a data set (because of the insuficient memory 
> exception). Moreover, my data has factors with more than 32 
> classes, which causes another error.
> 
> Could you suggest any solution for my problem? Thank you a lot. 
> ____________________________________________________________
> Licitovat nejvyhodnejsi nabídku je postavene na hlavu! Skoda 
> Octavia nyni se zvyhodnenim az 90.000 Kc! 
> http://ad2.seznam.cz/redir.cgi?instance=68740%26url=http://www
.skoda-auto.cz/action/fast




More information about the R-help mailing list