[R] pruning trees using rpart
Prof Brian Ripley
ripley at stats.ox.ac.uk
Wed Dec 17 10:59:28 CET 2008
On Wed, 17 Dec 2008, Tom Cattaert wrote:
> I am using the packages tree and rpart to build a classification tree to
> predict a 0/1 outcome. The package rpart has the advantage that the function
> plotcp gives a visual representation of the cross-validation results with a
> horizontal line indicating the 1 standard error rule, i.e. the
> recommendation to select the most parsimonious model (the smallest tree)
> whose error is not more than one standard error above the error of the best
> However, in the rpart package I am not getting trees of all sizes but for
> example three sizes are 1,2,5 in one example I am working with, while with
> cv.tree in package tree it gives 1,2,3,4,5 like I would guess it should
> (weakest link pruning successively collapses the internal nodes that
> contrubute the least). What is the reason for this?
How are we to know without the reproducible example you were asked for?
The pruning sequence need not cover all sizes, but it depends on the
inputs and the tuning parameters.
> A second problem I am having in both packages is that the cross-validation
> results are highly variable between different runs of the programs. This is
> not unexpected as cross-validations means that the dataset is randomly
> divided in 10 equal subsets, which can be done in a lot of different ways.
> One then hopes that the results do not depend on this very much, but I
> observed they do often. Should one then do this many times, e.g. 100, each
> time select the model using the 1 standard error rule, and in the end count
> which model got selected most often? Or rather do it many times and average
> the means and standard errors of the prediction error? Or does a very high
> variability in cross-validation results mean that the dataset is too small
> to reach conclusions?
MASS (the book) covers this.
> Kind regards and thanks for your help,
> [[alternative HTML version deleted]]
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help