[R] random forest mtry and mse

Fri Oct 12 15:10:23 CEST 2007

Dave,

> I have been using random forest on a data set with 226 sites and 36 
> explanatory variables (continuous and categorical). When I use 
> "tune.randomforest" to determine the best value to use in "mtry" there

> is a fairly consistent and steady decrease in MSE, with the optimum of

> "mtry" usually equal to 1. Why would that occur, and what does it 
> signify? What I would assume is that most of my explanatory variables 
> have little to no explanatory power. Does that sound about right?

I'm not sure that it means anything (I've seen this happen too). 

Essentially, this would indicate that, for this particular dataset, the
random forest model needs the trees to be as uncorrelated as possible.
If it were to "like" mtry = # predictors, this would indicate that
bagging was the optimal model. There is the no free lunch theorem and
this would apply to possible random forest sub-models; without
information related to the specifics of the problem at hand, there is no
reason to believe that any one model is uniformly best across problems.
Did you have any subject-specific reason to think larger values of mtry
were optimal?

What was the difference in performance across all of the candidate
values of mtry? I don't usually see a huge effect of altering mtry
(change in accuracy or Rsquared <= 5% in classification and regression
models, respectively) relative the variation in resampling.

Max