[R] User defined split function in Rpart

Wed Jan 3 17:56:18 CET 2007

Dear all,
 I'm trying to manage with user defined split function in rpart
(file rpart\tests\usersplits.R in 
http://cran.r-project.org/src/contrib/rpart_3.1-34.tar.gz - see bottom of 
the email).
Suppose to have the following data.frame (note that x's values are already 
sorted)
> D
y x
1 7 0.428
2 3 0.876
3 1 1.467
4 6 1.492
5 3 1.703
6 4 2.406
7 8 2.628
8 6 2.879
9 5 3.025
10 3 3.494
11 2 3.496
12 6 4.623
13 4 4.824
14 6 4.847
15 2 6.234
16 7 7.041
17 2 8.600
18 4 9.225
19 5 9.381
20 8 9.986

Running rpart and setting minbucket=1 and maxdepth=1 we get the following 
tree (which uses, by default, deviance):
> rpart(D$y~D$x,control=rpart.control(minbucket=1,maxdepth=1))
    n= 20
    node), split, n, deviance, yval * denotes terminal node
1) root 20 84.80000 4.600000
2) D$x< 9.6835 19 72.63158 4.421053 *
3) D$x>=9.6835 1 0.00000 8.000000 *

This means that the first 19 observation has been sent to the left side of 
the tree and one observation to the right.
This is correct when we observe goodness (the maximum is the last element of 
the vector).

The thing i really don't understand is the direction vector.
# direction= -1 = send "y< cutpoint" to the left side of the tree
# 1 = send "y< cutpoint" to the right

What does it mean ?
In the example here considered we have
> sign(lmean)
[1] 1 1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Which is the criterion used ?
In my opinion we should have all the values equal to -1 given that they have 
to be sent to left side of the tree.
Does someone can help me ?
Thank you

#######################################################
# The split function, where most of the work occurs.
# Called once per split variable per node.
# If continuous=T (the case here considered)
# The actual x variable is ordered
# y is supplied in the sort order of x, with no missings,
# return two vectors of length (n-1):
# goodness = goodness of the split, larger numbers are better.
# 0 = couldn't find any worthwhile split
# the ith value of goodness evaluates splitting obs 1:i vs (i+1):n
# direction= -1 = send "y< cutpoint" to the left side of the tree
# 1 = send "y< cutpoint" to the right
# this is not a big deal, but making larger "mean y's" move towards
# the right of the tree, as we do here, seems to make it easier to
# read
# If continuos=F, x is a set of integers defining the groups for an
# unordered predictor. In this case:
# direction = a vector of length m= "# groups". It asserts that the
# best split can be found by lining the groups up in this order
# and going from left to right, so that only m-1 splits need to
# be evaluated rather than 2^(m-1)
# goodness = m-1 values, as before.
#
# The reason for returning a vector of goodness is that the C routine
# enforces the "minbucket" constraint. It selects the best return value
# that is not too close to an edge.
The vector wt of weights in our case is:
> wt
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

temp2 <- function(y, wt, x, parms, continuous) {
# Center y
n <- length(y)
y <- y- sum(y*wt)/sum(wt)
if (continuous) {
# continuous x variable
temp <- cumsum(y*wt)[-n]
left.wt <- cumsum(wt)[-n]
right.wt <- sum(wt) - left.wt
lmean <- temp/left.wt
rmean <- -temp/right.wt
goodness <- (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2)
list(goodness= goodness, direction=sign(lmean))
}
}

Paolo Radaelli
Dipartimento di Metodi Quantitativi per le Scienze Economiche ed Aziendali
Facoltà di Economia
Università degli Studi di Milano-Bicocca
P.zza dell'Ateneo Nuovo, 1
20126 Milano
Italy
e-mail paolo.radaelli a unimib.it