[R] Random Forest & Cross Validation

Wed Feb 23 01:17:19 CET 2011

If you want to get honest estimates of accuracy, you should repeat the feature selection within the resampling (not the test set). You will get different lists each time, but that's the point. Right now you are not capturing that uncertainty which is why the oob and test set results differ so much.

The list you get int the original training set is still the real list. The resampling results help you understand how much you might be overfitting the *variables*.

Max

On Feb 22, 2011, at 4:39 PM, ronzhao <yzhaohsph at gmail.com> wrote:

> 
> Thanks, Max.
> 
> Yes, I　did some feature selections in the training set. Basically, I
> selected the top 1000 SNPs based on OOB error and grow the forest using
> training set, then using the test set to validate the forest grown.
> 
> But if I do the same thing in test set, the top SNPs would be different than
> those in training set. That may be difficult to interpret.
> 
> 
> 
> 
> -- 
> View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-tp3314777p3320094.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.