[R] Why doesn't rpart split further

Kevin Li k||6891 @end|ng |rom gm@||@com
Thu May 7 21:07:50 CEST 2020


Hi,

I am using the rpart package to construct regression trees and for the
purposes of simulation, would like to the tree completely split: each
leaf should contain exactly one observation.

However, I have observed that even by setting minsplit = 2, i.e.,

```
control <- rpart.control(
    minsplit = 2,
    cp = -1,
    xval = 0,
    maxcompete = 0,
    usesurrogate=0,
    maxdepth=30
)

model <- rpart(...., control = control)
```

the model will still have leaf nodes with more than one observation.
In fact, when I choose a subset of my dataset which fall into the same
terminal leaf, and run rpart on that subset, further split will occur.
Any advice on why this is occuring? Thanks!


Best regards,
Kevin


P.S. A snippet to showcase the behavior above:

--

library(rpart)
library(data.table)

mu <- function(x, y, z) sin(10 * pi * x + 2 * y) - cos(10 * pi * y) + exp(z)

control <- rpart.control(
    minsplit = 2,
    cp = -1,
    xval = 0,
    maxcompete = 0,
    usesurrogate=0,
    maxdepth=30
)

gen.data <- function(n, sd = 0.5) {
    X <- matrix(runif(3 * n), ncol=3)
    colnames(X) <- c('x', 'y', 'z')
    e <- rnorm(n, sd = sd)

    X <- data.table(X)
    X[, mu := mu(x, y, z)]
    X[, A := mu + e]
    return(X[])
}

# Run rpart on the simulated dataset ...
set.seed(12321)
X <- gen.data(30000, sd = 0.1)
X[, i := .I]
mod <- rpart(A ~ x + y + z, X, control = control)

frame <- as.data.table(mod$frame, keep.rownames=TRUE)
frame[, rn := as.integer(rn)][, i := .I]
setnames(frame, "rn", "id")

splits <- as.data.table(mod$splits, keep.rownames=TRUE)
setnames(splits, "rn", "var")
splits[, var := factor(var)]

where <- data.table(i = seq(1, X[,.N]), where=mod$where)


# m = 7191 is the row of the leaf that contains the most observations,
# in this case 11.
m <- frame[var == "<leaf>"][order(-n)][1, i]
obs <- where[where == m, i]

# Collect those 11 observations another dataframe
X2 <- X[i %in% obs]

# observe that rpart will split on that subset again, why?
mod2 <- rpart(A ~ x + y + z, X2, control=control)



More information about the R-help mailing list