[R] Strategies to deal with unbalanced classification data in randomForest

Sat Mar 3 03:19:09 CET 2012

Hello all,

I have become somewhat confused with options available for dealing
with a highly unbalanced data set (10000 in one class, 50 in the
other). As a summary I am unsure:

a) if I am perform the two class weighting methods properly,
b) if the data are too unbalanced and that this type of analysis is
appropriate and
c) if there is any interaction between the weighting for class
imbalances and number of trees in a forest.

An example will illustrate this best. Say I have a data set like the following:

df <- rbind(
data.frame(var1=runif(10000, 10, 50),
           var2=runif(10000, -3, 3),
           var3=runif(10000, 0.1, 0.25),
           cls=factor("CLASS-1")
           ),
data.frame(var1=runif(50, 10, 50),
           var2=runif(50, 2, 7),
           var3=runif(50, 0.2, 0.35),
           cls=factor("CLASS-2")
           )
)

## Where the response vector is highly imbalanced like so:
summary(df$cls)

library(randomForest)
set.seed(17)

## Now the obviously an extreme case but I am wondering what the
options are to deal with something like this.
## The problem with this situation manifests itself when I try to
train a random forest
## without accounting for this imbalance

df.rf<-randomForest(cls~var1+var2+var3, data=df,importance=TRUE)

## Now one option is to down sample the majority variable. However, I
can seem to find exactly
## how to do this. Does this seem correct?

df.rf.downsamp <-randomForest(cls~var1+var2+var3,
data=df,sampsize=c(50,50), importance=TRUE)
## 50 being the number of observations in the minority variable

## The other option which there seems to be some confusion over is
establish some class weights
## to balance the error rate. This approach I've mostly drawn from here:
## http://stat-www.berkeley.edu/users/breiman/RandomForests/cc_home.htm#balance
## This might not be appropriate, however, as of September it looks
like Breiman method wasn't used in R
df.rf.weights<-randomForest(cls~var1+var2+var3, data=df,classwt=c(1,
600), importance=TRUE)

## Nevertheless, what I am concerned about is the effect of an
unbalanced data set has on my randomForest model
## For example:

par(mfrow=c(1,3))
plot(df.rf)
plot(df.rf.downsamp)
plot(df.rf.weights)

presents three very different scenarios and I having trouble resolving
the issues I mentioned above. I am extremely grateful for all the work
that has been done on randomForests in R up to this point. I was
hoping that someone, with more experience, might be able to advise
what the best strategy is to deal with this problem. Which of these
approaches are best and am I using them right?

Thanks so much in advance for any help.

Sam

> sessionInfo()
R version 2.14.2 (2012-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
LC_TIME=en_CA.UTF-8
 [4] LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8
LC_MESSAGES=en_CA.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
LC_ADDRESS=C
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8
LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] ggplot2_0.8.9 plyr_1.7.1    tools_2.14.2