[R] memory problems when combining randomForests

Eleni Rapsomaniki e.rapsomaniki at mail.cryst.bbk.ac.uk
Sat Jul 29 15:14:55 CEST 2006


Hello again,

The reason why I thought the order at which rows are passed to randomForest
affect the error rate is because I get different results for different ways of
splitting my positive/negative data. 

First get the data (attached with this email)
pos.df=read.table("C:/Program Files/R/rw2011/pos.df", header=T)
neg.df=read.table("C:/Program Files/R/rw2011/neg.df", header=T)
library(randomForest)
#The first 2 columns are explanatory variables (which incidentally are not
discriminative at all if one looks at their distributions), the 3rd is the
class (pos or neg) 

train2test.ratio=8/10
min_len=min(nrow(pos.df), nrow(neg.df))
class_index=which(names(pos.df)=="class") #is the same for neg.df
train_size=as.integer(min_len*train2test.ratio)

############   Way 1
train.indicesP=sample(seq(1:nrow(pos.df)), size=train_size, replace=FALSE)
train.indicesN=sample(seq(1:nrow(neg.df)), size=train_size, replace=FALSE)

trainP=pos.df[train.indicesP,]
trainN=neg.df[train.indicesN,]
testP=pos.df[-train.indicesP,]
testN=neg.df[-train.indicesN,]

mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-class_index],
y=rbind(trainP, trainN)[,class_index], xtest=rbind(testP,
testN)[,-class_index], ytest=rbind(testP, testN)[,class_index],
importance=TRUE,proximity=FALSE, keep.forest=FALSE)
mydata.rf$test$confusion

##############   Way 2
ind <- sample(2, min(nrow(pos.df), nrow(neg.df)), replace = TRUE,
prob=c(train2test.ratio, (1-train2test.ratio)))
trainP=pos.df[ind==1,]
trainN=neg.df[ind==1,]
testP=pos.df[ind==2,]
testN=neg.df[ind==2,]

mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-dir_index], y=rbind(trainP,
trainN)[,dir_index], xtest=rbind(testP, testN)[,-dir_index], ytest=rbind(testP,
testN)[,dir_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE)
mydata.rf$test$confusion

########### Way 3
subset_start=1
subset_end=subset_start+train_size
train_index=seq(subset_start:subset_end)
trainP=pos.df[train_index,]
trainN=neg.df[train_index,]
testP=pos.df[-train_index,]
testN=neg.df[-train_index,]

mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-dir_index], y=rbind(trainP,
trainN)[,dir_index], xtest=rbind(testP, testN)[,-dir_index], ytest=rbind(testP,
testN)[,dir_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE)
mydata.rf$test$confusion

########### end

The first 2 methods give me an abnormally low error rate (compared to what I
get using the same data on a naiveBayes method) while the last one seems more
realistic, but the difference in error rates is very significant. I need to use
the last method to cross-validate subsets of my data sequentially(the first two
methods use random rows throughout the length of the data), unless there is a
better way to do it (?). Something must be very different between the first 2
methods and the last, but which is the correct one?

I would greatly appreciate any suggestions on this!

Many Thanks
Eleni Rapsomaniki



More information about the R-help mailing list