[R] rpart question

Mon Jan 16 03:10:57 CET 2012

Hi Amanda,

Sorry for the bit of a slow response (classes and research have been
chaotic).  Below are details on what I looked at and a few suggestions
at the end for what you can do.

To the general R community: summary.rpart() makes explicit the default
dropping behavior of `[` which makes me think that it may be
important, but it seems to cause problems in the case of only one node
because a 1 x k matrix is passed which when the dimensions are dropped
results in a vector.  Could this be changed to drop = FALSE (fixing
the case for one node) without causing problems for other models?

Cheers,

Josh

## Read in example data
trial <- structure(list(ENROLL_YN = structure(c(1L, 1L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("N",
"Y"), class = "factor"), MINORITY = c(0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L)), .Names =
c("ENROLL_YN",
"MINORITY"), class = "data.frame", row.names = c(8566L, 7657L,
3155L, 6429L, 8651L, 7973L, 6L, 5865L, 5878L, 5037L, 6950L, 9139L,
960L, 3058L, 7979L, 2465L, 4231L, 1529L, 7500L, 8248L))

require(rpart)

## fit the model
## no errors suggesting the problem is not here
m <- rpart(ENROLL_YN ~ MINORITY, data = trial, method="class")

## this throws an error
## makes me think that either some summary information
## or the print/show methods are the cause
summary(m)

## look at the class of the model object
class(m)

## look at the methods for summary
methods(summary)

## poke in the source code for summary.rpart
## (note non exported function so using :::)
rpart:::summary.rpart

## we already know from your traceback() output the code to look for
## x$functions$summary
## looking at the summary.rpart source
## x is the model object
## so....

m$functions$summary

## yval, the first argument evidently needs at least two dimensions
## and at least 2 columns
## back at the summary.rpart code, it looks like what is getting passed is

## else tprint <- x$functions$summary(ff$yval2[rows, , drop = FALSE],
##          ff$dev[rows], ff$wt[rows], ylevel, digits)

# so what is ff is defined earlier as x$frame (where x is the model object)

m$frame$yval2

## is a 1 x 5 matrix
## look what happens when we select all of it with drop = TRUE
m$frame$yval2[, , drop = TRUE]

## looking now at ?rpart.object where we learn that the frame element contains:
## Extra response information is in 'yval2', which contains the
##           number of events at the node (poisson), or a matrix
##           containing the fitted class, the class counts for each node
##           and the class probabilities (classification).  Also included
##           in the frame are 'complexity', the complexity parameter at
##           which this split will collapse, 'ncompete', the number of
##           competitor splits retained, and 'nsurrogate', the number of
##           surrogate splits retained.

## basically, the issue is, your model (at least in the example data)
only has 1 node
## so the matrix has 1 row, and when drop = TRUE, this reduces yval2 to a vector
## which causes problems for the summary methods

## I am not familiar enough with rpart to say if this is at it should be
## or if perhaps a modification is in order

## for here and now, you can either just not use summary()
## find a way to get more nodes
## or create a copy of rpart:::summary.rpart where you change drop =
TRUE to drop = FALSE
## around line 57 of the function.  Call it something new (like rpartSummary2)
## then rpartSummary2(m) and it will work
## I did this and got:

## > rpartSummary2(m)
## Call:
## rpart(formula = ENROLL_YN ~ MINORITY, data = trial, method = "class")
##   n= 20

##     CP nsplit rel error
## 1 0.01      0         1

## Node number 1: 20 observations
##   predicted class=N  expected loss=0.15
##     class counts:    17     3
##    probabilities: 0.850 0.150

On Wed, Jan 11, 2012 at 1:31 PM, Amanda Marie Elling <elling at stolaf.edu> wrote:
> Hi Josh,
>    Thanks for getting back to us so fast!!
> We created a subset of 20 cases and still ran into the same issue, I have
> copied the code below along with the dput() and traceback() outputs.
>
>> trial=accept.students.n08[sample(1:5000,20),]
>> dput(trial[, c("ENROLL_YN", "MINORITY")])
> structure(list(ENROLL_YN = structure(c(1L, 1L, 1L, 1L, 2L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("N",
> "Y"), class = "factor"), MINORITY = c(0L, 0L, 1L, 0L, 0L, 0L,
> 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L)), .Names =
> c("ENROLL_YN",
> "MINORITY"), class = "data.frame", row.names = c(8566L, 7657L,
> 3155L, 6429L, 8651L, 7973L, 6L, 5865L, 5878L, 5037L, 6950L, 9139L,
> 960L, 3058L, 7979L, 2465L, 4231L, 1529L, 7500L, 8248L))
>> fit_rpart2=rpart(trial$ENROLL_YN~trial$MINORITY, method="class")
>> summary(fit_rpart2)
> Call:
> rpart(formula = trial$ENROLL_YN ~ trial$MINORITY, method = "class")
>   n= 20
>
>     CP nsplit rel error
> 1 0.01      0         1
>
> Error in yval[, 1] : incorrect number of dimensions
>> traceback()
> 3: x$functions$summary(ff$yval2[rows, , drop = TRUE], ff$dev[rows],
>        ff$wt[rows], ylevel, digits)
> 2: summary.rpart(fit_rpart2)
> 1: summary(fit_rpart2)
>
>> data.frame(trial$MINORITY,trial$ENROLL_YN)
>    trial.MINORITY trial.ENROLL_YN
> 1               0               N
> 2               0               N
> 3               1               N
> 4               0               N
> 5               0               Y
> 6               0               N
> 7               0               N
> 8               0               N
> 9               0               N
> 10              0               N
> 11              1               N
> 12              0               N
> 13              0               N
> 14              0               Y
> 15              0               N
> 16              0               N
> 17              0               N
> 18              1               N
> 19              1               Y
> 20              0               N
>
> We are still unsure what the error is referring to. Thoughts?? Let us know
> if you need anything else. Thanks so much for your help!
>
> Amanda
>
>
> On Sun, Jan 8, 2012 at 7:41 PM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
>>
>> Hi Amanda,
>>
>> Can you reproduce the error with a small subset of the data?  If so,
>> could you send it to us?  For instance if say 20 cases is sufficient,
>> you could send the output of dput() which pastes easily into the
>> console:
>>
>> dput(yourdata[, c("ENROLL_YN", "MINORITY")])
>>
>> You could also try calling traceback() after the error to get a bit
>> more diagnostics (and post those if they do not make any sense or help
>> you).
>>
>> Hope this helps,
>>
>> Josh
>>
>> On Sun, Jan 8, 2012 at 1:48 PM, Amanda Marie Elling <elling at stolaf.edu>
>> wrote:
>> > We are trying to make a decision tree using rpart and we are continually
>> > running into the following error:
>> >
>> >> fit_rpart=rpart(ENROLL_YN~MINORITY,method="class")
>> >> summary(fit_rpart)
>> > Call:
>> > rpart(formula = ENROLL_YN ~ MINORITY, method = "class")
>> >  n= 5725
>> >
>> >  CP nsplit rel error
>> > 1  0      0         1
>> > Error in yval[, 1] : incorrect number of dimensions
>> >
>> > ENROLL_YN is a categorical variable with two options- yes or no.
>> > MINORITY is also a categorical variable with two options- 0 or 1.
>> >
>> > We have confirmed that all variables are the same length and there are
>> > no
>> > NAs.
>> >
>> > Does anyone have any ideas that might help?? All thoughts would be
>> > appreciated, thanks!
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>> --
>> Joshua Wiley
>> Ph.D. Student, Health Psychology
>> Programmer Analyst II, Statistical Consulting Group
>> University of California, Los Angeles
>> https://joshuawiley.com/
>
>

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/