[R] Question about randomForest

Sun Nov 27 10:26:59 CET 2011

I am pretty sure that when each tree is fitted the error rate for tree 'i' is it's performance on the data which was not used to fit the ith tree (OOB). In this way cross validation is performed for each tree but I do not think that all trees fitted prior are involved in the computation of that error. The idea (I think) is that if enough trees are fitted to randomly selected data, the phenomenon of overfitting will die out when you use the voting system and the error out of sample will approximate the average of the out of sample performance for all trees individually. If plotting the error by number of trees for example, I think the function plots the average of the oob errors by tree up until tree i. The reason I suspect this is because if you were handed a forest with the data used to create it, there would be no meaningful way to detect how well it would perform out of sample without getting at the error for each tree out of sample. This seems the most 'honest' option since using all trees up until tree i could be misleading (and outlier prone) due to uneven sampling with a small number of replications, not to mention earlier trees would get more votes on the error.
 Please correct me if I am wrong, 
         Hopefully a specialist will come along and clear this up,
          Ken Hutchison 

On Nov 27, 2554 BE, at 3:21 AM, Matthew Francis <mattjamesfrancis at gmail.com> wrote:

> Thanks for the help. Let me explain in more detail how I think that
> randomForest works so that you (or others) can more easily see the
> error of my ways.
> 
> The function first takes a random sample of the data, of the size
> specified by the sampsize argument. With this it fully grows a tree
> resulting in a horribly over-fitted classifier for the random sub-set.
> It then repeats this again with a different sample to generate the
> next tree and so on.
> 
> Now, my understanding is that after each tree is constructed, a test
> prediction for the *whole* training data set is made by combining the
> results of all trees (so e.g. for classification the majority votes of
> all individual tree predictions). From this an error rate is
> determined (applicable to the ensemble applied to the training data)
> and reported in the err.rate member of the returned randomForest
> object. If you look at the error rate (or plot it using the default
> plot method) you see that it starts out very high when only 1 or a few
> over-fitted trees are contributing, but once the forest gets larger
> the error rate drops since the ensemble is doing its job. It doesn't
> make sense to me that this error rate is for a sub-set of the data,
> since the sub-set in question changes at each step (i.e. at each tree
> construction)?
> 
> By doing cross-validation test making 'training' and 'test' sets from
> the data I have, I do find that I get error rates on the test sets
> comparable to the error rate that is obtained from the prediction
> member of the returned randomForest object. So that does seem to be
> the 'correct' error.
> 
> By my understanding the error reported for the ith tree is that
> obtained using all trees up to and including the ith tree to make an
> ensemble prediction. Therefore the final error reported should be the
> same as that obtained using the predict.randomForest function on the
> training set, because by my understanding that should return an
> identical result to that used to generate the error rate for the final
> tree constructed??
> 
> Sorry that is a bit long winded, but I hope someone can point out
> where I'm going wrong and set me straight.
> 
> Thanks!
> 
> On Sun, Nov 27, 2011 at 11:44 AM, Weidong Gu <anopheles123 at gmail.com> wrote:
>> Hi Matthew,
>> 
>> The error rate reported by randomForest is the prediction error based
>> on out-of-bag OOB data. Therefore, it is different from prediction
>> error on the original data  since each tree was built using bootstrap
>> samples (about 70% of the original data), and the error rate of OOB is
>> likely higher than the prediction error of the original data as you
>> observed.
>> 
>> Weidong
>> 
>> On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis
>> <mattjamesfrancis at gmail.com> wrote:
>>> I've been using the R package randomForest but there is an aspect I
>>> cannot work out the meaning of. After calling the randomForest
>>> function, the returned object contains an element called prediction,
>>> which is the prediction obtained using all the trees (at least that's
>>> my understanding). I've checked that this prediction set has the error
>>> rate as reported by err.rate.
>>> 
>>> However, if I send the training data back into the the
>>> predict.randomForest function I find I get a different result to the
>>> stored set of predictions. This is true for both classification and
>>> regression. I find the predictions obtained this way also have a much
>>> lower error rate and perform very well (suspiciously well...) on
>>> measures such as AUC.
>>> 
>>> My understanding is that the two predictions above should be the same.
>>> Since they are not, I must be not understanding something properly.
>>> Any ideas what's going on?
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.