[R] Does rpart package have some requirements on the original data set?

Fri Feb 16 00:51:05 CET 2007

Hi,
  try to set minsplit=2 and cp=0. After training you can prune with
different values of cp, and plot how the accuracy changes.

try this code (which I'm sure can be improved)

require(rpart)

rpart.prune.stats <- function(unpruned.tree,testset,class.index.name,cp) {
    acc.rpart.pruned <- list()
    nnodes <- NULL

    rpart.pruned <- unpruned.tree;
    for(i in 1:length(cp)) {
        print(paste("cp =",cp[i]))

        rpart.pruned <- prune(rpart.pruned,cp[i])
        pred.rpart.pruned <- predict(rpart.pruned,testset,type="class")
        acc <- sum(pred.rpart.pruned==testset[,class.index.name])/nrow(testset)
        acc.rpart.pruned <- c(acc.rpart.pruned,list(acc))
        nnodes <- c(nnodes,nrow(rpart.pruned$frame))
    }

    return(list(acc = acc.rpart.pruned, nnodes = nnodes))
}

plot.rpart.prune.results <-
function(formula,traininingset,testset,class.index.name,dataset.name,cp,add=F,ylim=NULL)
{

     rpart.unpruned <-
rpart(formula,data=traininingset,control=rpart.control(minsplit=2,cp=0))
     res <- rpart.prune.stats(rpart.unpruned,testset,class.index.name,cp)

     x <- unlist(res$acc)
     y <- unlist(res$nnodes)

     print(x)
     print(y)

    if(add)
        par(new=T)
    plot(cp,x,type="l",col="blue",ylim=ylim,ann=F)
    text(cp[c(seq(1,length(cp),by=5))],x[c(seq(1,length(cp),by=5))],paste("(",y[seq(1,length(cp),by=5)],")",sep=""),pos=3,cex=0.5)
    title(main=dataset.name,xlab="cp",ylab="Accuracy",font=3,cex=0.5)
}

and call it using something similar
plot.rpart.prune.results(Class~.,DatasetX.train,DatasetX.test,"Class","DatasetX",cp=seq(0,0.005,by=0.0001))

You can also oversample the minority class using sampling with
replacement or undersample the majority class.  This are two very
simple techniques used in machine learning when dealing with
unbalanced datasets (there are more complicated techniques which
produce better results, though)

hope this helps,
cheers,
Roberto

On 2/15/07, Liu, Ningwei <ningwei.liu at countryfinancial.com> wrote:
> Hi,
>
>
>
> I am currently studying Decision Trees by using rpart package in R. I
> artificially created a data set which includes the dependant variable
> (y) and a few independent variables (x1, x2...). The dependant variable
> y only comprises 0 and 1. 90% of y are 1 and 10% of y are 0. When I
> apply rpart to it, there is no splitting at all.
>
>
>
> I am wondering whether this is because of the "special" distribution of
> y. Since the majority of y is 1 (information in the data set is small),
> rpart automatically regards it as already a single class and therefore
> won't proceed any further. If this understanding is correct, what I
> should do if I still want rpart to do something on this data set?
>
>
>
>
>
> Thanks a lot!
>
>
>
>
>
> Ningwei
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>