[R] Random forests

Wed Dec 19 10:39:17 CET 2007

On Tue, 2007-12-18 at 16:27 -0600, Naiara Pinto wrote:
> Dear all,
> 
> I would like to use a tree regression method to analyze my dataset. I
> am interested in the fact that random forests creates in-bag and
> out-of-bag datasets, but I also need an estimate of support for each
> split. That seems hard to do in random forests since each tree is
> grown using a subset of the predictor variables.
> 
> I was thinking of setting mtry = number of predictor variables,
> growing several trees, and computing the support for each node as the
> number of times that a certain predictor variable was chosen for that
> node. Can this be implemented using random forests?

Hi Naiara,

I'm so not an expert here, but what you propose with mty = number of
predictors will give you a procedure known as bagging.

You talk about support for the split and then for the node. Is this just
a typo or are you interested in the two different things?

I'm not aware of how you do the latter in bagging or random forests as
the whole point is to grow large trees not pruned ones. As to the
former, trees are unstable, change the data used to train them just a
little and you can get a very different fitted tree.

Bagging and random forests exploit this to produce a better prediction
machine / classifier by using n poor trees rather than one best tree.
They do this by adding randomness to the procedure by bootstrap sampling
the training data, and in the case of random forest, randomly sampling a
small number, mtry, of available predictors to grow each tree. As such
there is no correspondence between the splits of one tree and the splits
of another, so trying to compare how many times a certain split in one
or more trees is formed by the same predictor. So it doesn't make sense
(to me it may to others) to focus on individual splits in the n trees.

I don't know what you mean exactly by "support", but if you are trying
to get a measure of how important each of your predictors is in
explaining variance in your response, then take a look at the
importance() function in the randomForest package. This produces a
couple of measures that allow you to determine which predictors
contribute most to reducing node impurity or MSE.

HTH

G

> 
> Thanks!
> 
> Naiara.
> 
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%