[R] Some questions on Rpart algorithm

Tue Oct 17 21:17:44 CEST 2006

With regards to your first question, here's a function I used a couple
of times to get plots similar to those you're looking for. (Search the
list for how to find the source code. Also, there's a reference other
than MASS on the ?rpart page.)

#bogdan romocea 2006-06
#adapted source code from
#  - text.rpart() from package mvpart
#  - functions$text from rpart()
#  to get acceptable plots of classification trees
#the tweaked tree plots show the following:
#  - size of each node (counts and percentages)
#  - splitting rules
#  - % cases in each node, or counts
#  - targets with more than 3 categories are properly labelled through colors
#    (unlike in text.rpart() from mvpart)
#example:
#  x <- rpart(...,method="class")
#  plot(x,uniform=TRUE,margin=0.02)
#  my.tree.text(x,ncomp.offset=4)

my.tree.text <- function(x,percent=TRUE,pct.decimals=0,ncomp.offset=2,
   clr=c("red","yellow","blue","green","brown","purple","navy"))
{
frame <- x$frame ; col <- names(frame)
method <- x$method ; ylevels <- attr(x, "ylevels")
xy <- rpartco(x) ; node <- as.numeric(row.names(x$frame))
leaves <- rep(TRUE, nrow(frame))
bar.vals <- x$functions$bar(yval2 = frame$yval2)
node.size <- rowSums(bar.vals)
node.title <- paste(node.size," /
",round(100*node.size/node.size[1]),"%",sep="")
#---the node barplots
sub.barplot(xy$x,xy$y,bar.vals,leaves,xadj=1,yadj=1,bord=TRUE,line=TRUE,col=clr)
rx <- range(xy$x) ; ry <- range(xy$y)
#---the legend
if (!is.null(ylevels)) bar.labs <- ylevels else bar.labs <- dimnames(x$y)[[2]]
legend(min(xy$x) - 0.1 * rx, max(xy$y) + 0.05 * ry, bar.labs, col =
clr, pch = 15, bty = "n")
text(xy$x[leaves],xy$y[leaves],labels=node.title,pos=3,cex=1.5,offset=1)
#---the splitting rules
cxy <- par("cxy")
left.child <- match(2 * node, node)
right.child <- match(node * 2 + 1, node)
rows <- labels(x, pretty = pretty)
text(xy$x,xy$y + 0.5 * cxy[2],rows[left.child],pos=2,col="navy")
text(xy$x,xy$y + 0.5 * cxy[2],rows[right.child],pos=4,col="navy")
#---target composition per node (% or counts)
if (is.null(frame$yval2)) yval <- frame$yval[leaves] else yval <-
frame$yval2[leaves,]
nclass <- (ncol(yval) - 1)/2
counts <- yval[, 1 + (1:nclass)]
group <- yval[, 1]
if (!is.null(bar.labs)) group <- bar.labs[group]
if (percent) {
   #identical(counts / rowSums(counts),prop.table(counts,1))
   nbr <- round(100*prop.table(counts,1),pct.decimals)
   #t1 <- apply(matrix(nbr,ncol=nclass),2,paste,"%",sep="")
   #t2 <- apply(matrix(t1,ncol=nclass),1,paste,collapse="/")
   t2 <- apply(matrix(nbr,ncol=nclass),1,paste,collapse="|")
   nlab <- paste(format(group,justify="left"),"\n%: ",t2,sep = "")
} else {
   t2 <- apply(matrix(counts,ncol=nclass),1,paste,collapse="|")
   nlab <- paste(format(group,justify="left"),"\nN: ",t2,sep = "")
}
text(xy$x[leaves],xy$y[leaves],labels=nlab,pos=1,offset=ncomp.offset)
}

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Marcus, Jeffrey
> Sent: Tuesday, October 17, 2006 10:03 AM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Some questions on Rpart algorithm
>
> Hello:
>   I am using rpart and would like more background on how the
> splits are made
> and how to interpret results - also how to properly use
> text(.rpart). I have
> looked through Venables and Ripley and through the rpart help
> and still have
> some questions. If there is a source (say, Breiman et al)  on
> decision trees
> that would clear this all up,  please let me know. The questions below
> pertain to a classification task (ie., I'm using the "class"
> method). Many
> thanks in advance.
>
>
> (1)  I'd like text(.rpart) to print percentages of each class
> rather then
> counts. I don't see an option for this so would like to modify the
> text.rpart. However, I can't find the source since it is a
> method that's
> "hidden". How can I find the source?
>
> (2) printcp prints a table with columns cp, nsplit, rel
> error, xerror, xstd.
> I am guessing that cp is complexity, nsplit is the number of
> the split, rel
> error is the error on test set, xerror is cross-validation
> error and xstd is
> standard deviation of error across the cross-validation sets.
> Is there any
> documentation on this? For instance, how exactly is
> complexity computed?
>
> (3)  What's a "loss matrix?" Is it the cost place on each type of
> misclassification?
>
> (4) [More of a methodology question] In practice, when would one use
> different costs on different splitting variables?
>
> Thanks for any help on this.
>
>   Jeff
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>