[R] Sorting in trees problem

Axel Urbiz axel.urbiz at gmail.com
Wed Feb 24 17:45:24 CET 2016


As decision trees require sorting the variable used for splitting a given
node, I'm trying to avoid having this recurrent sorting by only sorting all
numeric variable first (and only once).

My attempt in doing this is shown in "Solution 2" below, but although I get
the desired result I think the %in% operation may be a costly one (and may
even offset the benefits of pre-sorting).

Any alternative solutions would be highly appreciated.

### Sample data

df <- data.frame(x1 = rnorm(20), x2 = rnorm(20))
w <- rep(1L, nrow(df))
# w == 1L denote observation present in the current node
w[c(1, 8, 10)] <- 0L

### The problem: sort x1 within observations present in the current node

### Solution 1: slow for repeated sorting
nodeObsInd <- which(w == 1L)
sol1 <- df[nodeObsInd, ]
sol1 <- sol1[order(sol1$x1), ]$x1

### Solution 2: sort all variables initially only.

sort_fun <- function(x)
  index <- order(x)
  x <- x[index]
  data.frame(x, index) # the index gives original position of the obs

s_df <- lapply(df, function(x) sort_fun(x))
sol2 <- s_df[[1]][s_df$x1$index %in% nodeObsInd, ]

### check same result

all.equal(sol1, sol2$x)


	[[alternative HTML version deleted]]

More information about the R-help mailing list