[R] Rpart and bagging - how is it done?

apjaworski at mmm.com apjaworski at mmm.com
Thu Mar 6 22:08:43 CET 2008


Hi there.

I was wondering if somebody knows how to perform a bagging procedure on a
classification tree without running the classifier with weights.

Let me first explain why I need this and then give some details of what I
have found out so far.

I am thinking about implementing the bagging procedure in Matlab.  Matlab
has a simple classification tree function (in their Statistics toolbox) but
it does not accept weights.  A modification of the Matlab procedure to
accommodate weights would be very complicated.

The rpart function in R accepts weights.  This seems to allow for a rather
simple implementation of bagging.  In fact Everitt and Hothorn in chapter 8
of "A Handbook of Statistical Analyses Using R" describe such a procedure.
The procedure consists in generating several samples with replacement from
the original data set.  This data set has N rows.  The implementation
described in the book first fits a non-pruned tree to the original data
set.  Then it generates several (say, 25) multinomial samples of size N
with probabilities 1/N.  Then, each sample is used in turn as the weight
vector to update the original tree fit.  Finally, all the updated trees are
combined to produce "consensus" class predictions.

Now, a typical realization of a multinomial sample consists of small
integers and several 0's.  I thought that the way that weighting worked was
this:  the observations with weights equal to 0 are omitted and the
observations with weights > 1 are essentially replicated according to the
weight.  So I thought that instead of running the rpart procedure with
weights, say, starting with (1, 0, 2, 0, 1, ... etc.)  I could simply
generate a sample data set by retaining row 1, omitting row 2, replicating
row 3 twice, omitting row 4, retaining row 5, etc.  However, this does not
seem to work as I expected.  Instead of getting identical trees (from
running weighted rpart on the original data set and running rpart on the
sample data set described above with no weighting) I get trees that are
completely different (different threshold values and different order of
variables entering the splits).  Moreover,  the predictions from these
trees can be different so the misclassification rates usually differ.

This finally brings me to my question - is there a way to mimic the
workings of the weighting in rpart by, for example, modification of the
data set or, perhaps, some other means.

Thanks in advance for your time,

Andy

__________________________________
Andy Jaworski
518-1-01
Process Laboratory
3M Corporate Research Laboratory
-----
E-mail: apjaworski at mmm.com
Tel:  (651) 733-6092
Fax:  (651) 736-3122



More information about the R-help mailing list