[R] Question about randomForest

Sun Nov 27 16:56:02 CET 2011

Matthew,

Your intepretation of calculating error rates based on the training
data is incorrect.

In Andy Liaw's help file "err.rate-- (classification only) vector
error rates of the prediction on the input data, the i-th element
being the (OOB) error rate for all trees up to the i-th."

My understanding is that the error rate is calculated by throwing the
OOB cases(after a few trees, all cases in the original data would
serve as OOB for some trees) to all the trees up to the i-th which
they are OOB and get the majority vote. The plot of a rf object
indicates that OOB error declines quickly after the ensemble becomes
sizable and increase variation in trees works! ( If they are based on
the training sets, you wouldn't see such a drop since each tree is
overfitting to the training set)

Weidong

On Sun, Nov 27, 2011 at 3:21 AM, Matthew Francis
<mattjamesfrancis at gmail.com> wrote:
> Thanks for the help. Let me explain in more detail how I think that
> randomForest works so that you (or others) can more easily see the
> error of my ways.
>
> The function first takes a random sample of the data, of the size
> specified by the sampsize argument. With this it fully grows a tree
> resulting in a horribly over-fitted classifier for the random sub-set.
> It then repeats this again with a different sample to generate the
> next tree and so on.
>
> Now, my understanding is that after each tree is constructed, a test
> prediction for the *whole* training data set is made by combining the
> results of all trees (so e.g. for classification the majority votes of
> all individual tree predictions). From this an error rate is
> determined (applicable to the ensemble applied to the training data)
> and reported in the err.rate member of the returned randomForest
> object. If you look at the error rate (or plot it using the default
> plot method) you see that it starts out very high when only 1 or a few
> over-fitted trees are contributing, but once the forest gets larger
> the error rate drops since the ensemble is doing its job. It doesn't
> make sense to me that this error rate is for a sub-set of the data,
> since the sub-set in question changes at each step (i.e. at each tree
> construction)?
>
> By doing cross-validation test making 'training' and 'test' sets from
> the data I have, I do find that I get error rates on the test sets
> comparable to the error rate that is obtained from the prediction
> member of the returned randomForest object. So that does seem to be
> the 'correct' error.
>
> By my understanding the error reported for the ith tree is that
> obtained using all trees up to and including the ith tree to make an
> ensemble prediction. Therefore the final error reported should be the
> same as that obtained using the predict.randomForest function on the
> training set, because by my understanding that should return an
> identical result to that used to generate the error rate for the final
> tree constructed??
>
> Sorry that is a bit long winded, but I hope someone can point out
> where I'm going wrong and set me straight.
>
> Thanks!
>
> On Sun, Nov 27, 2011 at 11:44 AM, Weidong Gu <anopheles123 at gmail.com> wrote:
>> Hi Matthew,
>>
>> The error rate reported by randomForest is the prediction error based
>> on out-of-bag OOB data. Therefore, it is different from prediction
>> error on the original data  since each tree was built using bootstrap
>> samples (about 70% of the original data), and the error rate of OOB is
>> likely higher than the prediction error of the original data as you
>> observed.
>>
>> Weidong
>>
>> On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis
>> <mattjamesfrancis at gmail.com> wrote:
>>> I've been using the R package randomForest but there is an aspect I
>>> cannot work out the meaning of. After calling the randomForest
>>> function, the returned object contains an element called prediction,
>>> which is the prediction obtained using all the trees (at least that's
>>> my understanding). I've checked that this prediction set has the error
>>> rate as reported by err.rate.
>>>
>>> However, if I send the training data back into the the
>>> predict.randomForest function I find I get a different result to the
>>> stored set of predictions. This is true for both classification and
>>> regression. I find the predictions obtained this way also have a much
>>> lower error rate and perform very well (suspiciously well...) on
>>> measures such as AUC.
>>>
>>> My understanding is that the two predictions above should be the same.
>>> Since they are not, I must be not understanding something properly.
>>> Any ideas what's going on?
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>