[R] Problems with randomForest for regression

Wed Oct 13 19:21:22 CEST 2004

Dear list,

I am trying to do a benchmark study for my case study. It is a regression 
problem. Among other models I use randomForest.

Using the following code the result is around 0.628, and this make sense 
comparing with other methods. The Theil function implements Theil's U 
statistic. I do not present the definition of some variables because it is not 
important to understand my problem. I use sliding window trategy.

library("randomForest")

rf.theil <- vector()
learner='randomForest'

for (i in 1:6)
  {
   eval.sum <- 0
   test.pos=test.pos.ini

   while (test.pos <= n)
     {
      naive.pred <- c(orig.data[test.pos-1,7])
      model <- randomForest(Duracao ~ ., data=orig.data[1:(test.pos-1),], 
		     na.action=na.omit, ntree=5000, mtry=i)
      preds <- predict(model,orig.data[test.pos:min(n,test.pos+relearn.step-
1),])
      test.pos <- test.pos+relearn.step

      a<-theil(preds, naive.pred, orig.data[test.pos:min
(n,test.pos+relearn.step-1),7])
      if (is.na(a)==FALSE) {eval.sum <- eval.sum + a}
     }
   rf.theil <- c(rf.theil, eval.sum/(trunc((n-test.pos.ini)/relearn.step)+1))
  }

rf.min <- min(rf.theil, na.rm=TRUE)
rf.indices <- seq(along=rf.theil)[rf.theil == rf.min]

But running 5 times randomForest for each value of i, and choosing the best 
result according U statistic, I got a value around 0.178... And this value 
does not make sense. I use the some strategie with nnet and it gives good 
results. The code is:

library("randomForest")

rf.theil <- vector()

for (i in 1:6)
  {
   eval <- 100000
   eval.sum <- 0
   test.pos=test.pos.ini

   while (test.pos <= n)
     {
      naive.pred <- c(orig.data[test.pos-1,7])
      for (j in 1:5)
        {
         model <- randomForest(Duracao ~ ., data=orig.data[1:(test.pos-1),], 
		     na.action=na.omit, ntree=5000, mtry=i)
         preds <- predict(model,
			  orig.data[test.pos:min(n,test.pos+relearn.step-1),])
         eval.temp <- theil(preds, naive.pred, 
		       orig.data[test.pos:min(n,test.pos+relearn.step-1),7])
         if (eval.temp < eval)
           eval <- eval.temp
        }
      if (is.na(eval)==FALSE) 
	eval.sum <- eval.sum + eval
      test.pos <- test.pos+relearn.step
     }
   rf.theil <- c(rf.theil, eval.sum/(trunc((n-test.pos.ini)/relearn.step)+1))
  }

rf.min <- min(rf.theil, na.rm=TRUE)

Thanks for any help

Joao Moreira