[R] Gradient Boosting Trees with correlated predictors in gbm

Mon Mar 1 18:01:53 CET 2010

In theory, the choice between two perfectly correlated predictors is
random. Therefore, the importance should be "diluted" by half.
However, this is implementation dependent.

For example, run this:

  set.seed(1)
  n <- 100
  p <- 10

  data <- as.data.frame(matrix(rnorm(n*(p-1)), nrow = n))
  data$dup <- data[, p-1]

  data$y <- 2 + 4 * data$dup - 2 * data$dup^2 + rnorm(n)

  data <- data[, sample(1:ncol(data))]

  str(data)

  library(gbm)
  fit <- gbm(y~., data = data,
             distribution = "gaussian",
             interaction.depth = 10,
             n.trees = 100,
             verbose = FALSE)
  summary(fit)

For gbm, the importance of two perfectly correlated predictors is
undiluted for the first one it finds (which is the dup variable under
this seed) and zero for the other. Change the random seed or the
column order and your results may vary.

For randomForests, it generally cuts the importance in half:

   library(randomForest)

   fit2 <- randomForest(y~., data = data, importance = TRUE)
   importance(fit2)

This is an extreme case (they are perfectly correlated) but
illustrates the theoretical and implementation issues.

Max

On Sun, Feb 28, 2010 at 9:50 AM, Lars Bishop <lars52r at gmail.com> wrote:
> Dear R users,
>
> I’m trying to understand how correlated predictors impact the Relative
> Importance measure in Stochastic Boosting Trees (J. Friedman).  As Friedman
> described “ …with single decision trees (referring to Brieman’s CART
> algorithm), the relative importance measure is augmented by a strategy
> involving surrogate splits intended to uncover the masking of influential
> variables by others highly associated with them. This strategy is most
> helpful with single decision trees where the opportunity for variables to
> participate in splitting is limited by the size of the tree. In the context
> of Boosting, however, the number of splitting opportunities is vastly
> increased, and surrogate unmasking is less essential”.
> Based on the results from the simulated example below, if I have, say two
> variables which are highly correlated, then the relative importance measure
> derived from Boosting will tend to be high for one of the predictors and low
> for the other.  I’m trying to reconcile this observation with Friedman’s
> description above, which according to my understanding, these two variables
> should have about the same measure of importance. I'll appreciate your
> comments.
> require(gbm)
> require(MASS)
> #Generate multivariate random data such that X1 is moderetly correlated by
> X2, strongly
> # correlated with X3, and not correlated with X4 or X5.
> cov.m <-
> matrix(c(1,0.5,0.9,0,0,0.5,1,0.2,0,0,0.9,0.2,1,0,0,0,0,0,1,0,0,0,0,0,1),5,5,
> byrow=T)
> n <- 2000 # obs
> X <- mvrnorm(n, rep(0, 5), cov.m)
> Y <- apply(X, 1, sum)
> SNR <- 10 # signal-to-noise ratio
> sigma <- sqrt(var(Y)/SNR)
> Y <- Y + rnorm(n,0,sigma)
> mydata <- data.frame(X,Y)
> #Fit Model (should take less than 20 seconds on an average modern computer)
> gbm1 <- gbm(formula = Y ~ X1 + X2 + X3 + X4 + X5,
> data=mydata,
> distribution = "gaussian",
> n.trees = 500,
> interaction.depth = 2,
> n.minobsinnode = 10,
> shrinkage = 0.1,
> bag.fraction = 0.5,
> train.fraction = 1,
> cv.folds=5,
> keep.data = TRUE,
> verbose = TRUE)
> ## Plot variable influence
> best.iter <- gbm.perf(gbm1, plot.it = T, method="cv")
> print(best.iter)
> summary(gbm1,n.trees=best.iter) # based on the estimated best number of
> trees
>
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

-- 

Max