[R] rpart - classification and regression trees (CART)

Katie N knishimura at gmail.com
Sat Dec 12 22:14:59 CET 2009


Hi,
I had a question regarding the rpart command in R.  I used seven continuous
predictor variables in the model and the variable called "TB122" was chosen
for the first split.  But in looking at the output, there are 4 variables
that improve the predicted membership equally (TB122, TB139, TB144, and
TB118) - output pasted below.

Node number 1: 268 observations,    complexity param=0.6
  predicted class=0  expected loss=0.3
    class counts:   197    71
   probabilities: 0.735 0.265 
  left son=2 (188 obs) right son=3 (80 obs)
  Primary splits:
      TB122 < 80  to the left,  improve=50, (0 missing)
      TB139 < 90  to the left,  improve=50, (0 missing)
      TB144 < 90  to the left,  improve=50, (0 missing)
      TB118 < 90  to the left,  improve=50, (0 missing)
      TB129 < 100 to the left,  improve=40, (0 missing)

I need to know what methods R is using to select the best variable for the
node.  Somewhere I read that the best split = greatest improvement in
predictive accuracy = maximum homogeneity of yes/no groups resulting from
the split = reduction of impurity.  I also read that the Gini index,
Chi-square, or G-square can be used evaluate the level of impurity.

For this function in R:
1) Why exactly did R pick TB122 over the other variables despite the fact
that they all had the same level of improvement?  Was TB122 chosen to be the
first node because the groups "TB122<80" and "TB122>80" were the most
homogeneous (ie had the least impurity)?
2) If R is using impurity to determine the best nodes, which method (the
Gini index, Chi-square, or G-square) is R using?

Thanks!
Katie
-- 
View this message in context: http://n4.nabble.com/rpart-classification-and-regression-trees-CART-tp962680p962680.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list