[R] cross-validation in rpart

Terry Therneau therneau at mayo.edu
Mon Jul 7 15:00:40 CEST 2008


-- begin included message
I'm having a problem with custom functions in rpart, and before I tear my
hair out trying to fix it, I want to make sure it's actually a problem.  It
seems that, when you write custom functions for rpart (init, split and eval)
then rpart no longer cross-validates the resulting tree to return errors.  A
simple test is to use the usersplits.R function to get a simple, custom
rpart function, and then change fit1 and fit2 so that the both have xvals of
10.  The problem occurs in that the cptable for fit1 doesn't have xerror or
xstd, despite the fact that the cross-validation is set to 10-fold.

I guess I just need conformation that cross-validation doesn't work with
custom functions, and if someone could explain to me why that is the case it
would be greatly appreciated.

Thanks,
Sam Stewart

---- end inclusion

  You are right, cross-validation does not happen automatically with 
user-written split functions.  We can think of cross-validation as having two 
steps:

   1. Get the predicted values for each observation, when that obs (or a group) 
is left out of the data set.  There is actually a vector of predicted values, 
one for each level of model complexity.  This step can be done using 
xpred.rpart, which does work for user-defined splits.  It returns a matrix with 
n rows (one per obs) and one column for each of the target cp values.  Call this 
matrix "yhat".

   2. Summarize each column of the above matrix yhat into a single "goodness" 
value.  For anova fitting, for instance, this is just colMeans((y-yhat)^2).  For 
classification models it is a bit more complex, we have to add up the expected 
loss L(y, hat) for each column using the loss matrix and the priors. 
   The reason that rpart does not do this step for a user-written function is 
that rpart does not know what summary is appropriate.  For some splitting rules, 
e.g. survival data split using a log-rank test, I'm not sure that \italics{I} 
know what summation is appropriate.  
   
   Terry Therneau



More information about the R-help mailing list