[R] Half Million features Selection (Random Forest)

Prof Brian Ripley ripley at stats.ox.ac.uk
Sat Jul 3 07:58:15 CEST 2004

How many cases do you have?  Since you apparently expect the dataset to be 
usable in R, you only have room to store a dataset with 200 cases or so 
(let alone space to analyse it).

Even selecting *one* variable is statistically nonsensical with less than
millions of cases (as otherwise the possibility of chance agreement of
predictors is too high -- and I don't known enough about your problem to 
do even a rough calculation with any confidence).

On Fri, 2 Jul 2004, daisy wrote:

> I have about half million binary features, and would like to find a
> model to estimate the continous response. According to the inference, I
> can express predictors and response by linear model. (ie. Design matrix:
> large sparse matrix with 0/1. Response: Continous number) Since it is
> not a classification problem, someone suggested me to try random forest
> in R. However, in the randomForest help page, it points out "For large
> data sets, especially those with large number of variables, calling
> 'randomForest' via the formula interface is not advised: There may be
> too much overhead in handling the formula." and I also gave a try on 300
> variables and R either gave me error message or no response. (OS:
> Windows XP; R:1.9.0 ; RAM:512MB) Is there any way to implement random
> forest on this big dataset? Any suggestion is welcome! Many thanks!

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

More information about the R-help mailing list