[R] Random Forest & Cross Validation

Sun Feb 20 20:48:40 CET 2011

> I am using randomForest package to do some prediction job on GWAS data. I
> firstly split the data into training and testing set (70% vs 30%), then
> using training set to grow the trees (ntree=100000). It looks that the OOB
> error in training set is good (<10%). However, it is not very good for the
> test set with a AUC only about 50%.

Did you do any feature selection in the training set? If so, you also
need to include that step in the cross-validation to get realistic
performance estimates (see Ambroise and McLachlan. Selection bias in
gene extraction on the basis of microarray gene-expression data.
Proceedings of the National Academy of Sciences (2002) vol. 99 (10)
pp. 6562-6566).

In the caret package, train() can be used to get cross-validation
estimates for RF and the sbf() function (for selection by filter) can
be used to include simple univariate filters in the CV procedure.

> Although some people said no cross-validation was necessary for RF, I still
> felt unsafe and thought a testing set is important. I felt really frustrated
> with the results.

CV is needed when you want an assessment of performance on a test set.
In this sense, RF is like any other method.

-- 

Max