[R] In rpart, how is "improve" calculated? (in the "class" case)

Tue Jun 14 21:56:40 CEST 2011

Tal,

For the Gini criterion, the "improve" value can be calculated as a 
weighted sum of the improvement in impurity.  Continuing with your 
original code:

# for "gini"
impurity_root<- gini(prop.table(table(y)))
impurity_l<- gini(prop.table(table(obs_0)))
impurity_R<-gini(prop.table(table(obs_1)))

# (13 and 7 are sample sizes in respective nodes)
13*(impurity_root - impurity_l) + 7*(impurity_root - impurity_R)
[1] 5.384615

This does not appear to extend immediately to the information criterion, 
however.  I'm not sure about the 6.84.

Ed

On 6/14/11 5:00 AM, r-help-request at r-project.org wrote:
> ------------------------------
>
> Message: 4
> Date: Mon, 13 Jun 2011 15:47:26 +0300
> From: Tal Galili<tal.galili at gmail.com>
> To:r-help at r-project.org
> Subject: [R] In rpart, how is "improve" calculated? (in the "class"
>          case)
> Message-ID:<BANLkTimp1aFQoYrKina7H0Rnk=0zKR_iDw at mail.gmail.com>
> Content-Type: text/plain
>
> Hi all,
>
> I apologies in advance if I am missing something very simple here, but since
> I failed at resolving this myself, I'm sending this question to the list.
>
> I would appreciate any help in understanding how the rpart function is
> (exactly) computing the "improve" (which is given in fit$split), and how it
> differs when using the split='information' vs split='gini' parameters.
>
> According to the help in rpart.object:
> "improve, which is the improvement in deviance given by this split"
>> From what I understand, that would mean that the "improve" value should not
> be different when using different "split" switches.  Since it is different,
> then I suspect that it is reflecting  the impurity measure somehow, but I
> can't seem to understand how exactly.
>
> Bellow is some simple R code showing the result for a simple classification
> tree, with what the function outputs, and what I would have expected to see
> if "improve" were to simply reflect the change in impurity.
>
>
> set.seed(1324)
> y<- sample(c(0,1), 20, T)
> x<- y
> x[1:5]<- 0
> require(rpart)
> fit<- rpart(y~x, method = "class", parms=list(split='information'))
> fit$split[,3] # why is improve here 6.84 ?
> fit<- rpart(y~x, method = "class", parms=list(split='gini'))
> fit$split[,3] # why is improve here 5.38 ?
>
>
> # Here is what I thought it should have been:
> # for "information"
> entropy<- function(p) {
> if(any(p==1)) return(0) # works for the case when y has only 0 and 1
> categories...
>   -sum(p*log(p,2))
> }
> gini<- function(p) {sum(p*(1-p))}
>
> obs_1<- y[x>.5]
> obs_0<- y[x<.5]
> n_l<- sum(x>.5)
> n_R<- sum(x<.5)
> n<- length(x)
>
> # for entropy (information)
> impurity_root<- entropy(prop.table(table(y)))
> impurity_l<- entropy(prop.table(table(obs_0)))
> impurity_R<-entropy(prop.table(table(obs_1)))
> # shouldn't this have been "improve" ??
> impurity_root - ((n_l/n)*impurity_l + (n_R/n)*impurity_R) # 0.7272
>
> # for "gini"
> impurity_root<- gini(prop.table(table(y)))
> impurity_l<- gini(prop.table(table(obs_0)))
> impurity_R<-gini(prop.table(table(obs_1)))
> impurity_root - ((n_l/n)*impurity_l + (n_R/n)*impurity_R) # 0.3757
>
>
> Thanks upfront,
> Tal
>
>
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me:Tal.Galili at gmail.com  |  972-52-7275845
> Read me:www.talgalili.com  (Hebrew) |www.biostatistics.co.il  (Hebrew) |
> www.r-statistics.com  (English)
> ----------------------------------------------------------------------------------------------

-- 
*** Note new email address ***
Ed Merkle, PhD
Assistant Professor
Department of Psychological Sciences (starting August 2011)
University of Missouri
Columbia, MO, USA 65211