[R] Fwd: cforest sampling methods

Torsten Hothorn Torsten.Hothorn at uzh.ch
Thu Mar 20 10:10:42 CET 2014


>
> Hi all,
>
> I've been using the randomForest package and I'm trying to make the switch
> over to party. My problem is that I have an extremely unbalanced outcome
> (only 1% of the data has a positive outcome) which makes resampling methods
> necessary.
>
> randomForest has a very useful argument that is sampsize which allows me to
> use a balanced subsample to build each tree in my forest. lets say the
> number of positive cases is 100, my forest would look something like this:
>
> rf<-randomForest(y~. ,data=train, ntree=800,replace=TRUE,sampsize = c(100,
> 100))
>
> so I use 100 cases and 100 controls to build each individual tree. Can I do
> the same for cforests? I know I can always upsample but I'd rather not.
>
> I've tried playing around with the weights argument but I'm either not
> getting it right or it's just the wrong thing to use.

weights are your friend here: Suppose you have 100 obs of the first and 
1000 obs of the second class. Using weights 1 / 100 for the class one obs 
and 1 / 1000 for the class two obs gives you a balanced sample:

y <- gl(2, 1)[c(rep(1, 100), rep(2, 1000))]
w <- 1 / (table(y))[y]
tapply(rmultinom(n = 1, size = length(y), prob = w), y, sum)

Best,

Torsten


>
> Any advice on how to adapt cforests to datasets with imbalanced outcomes is
> greatly appreciated...
>
>
>
> Thanks!
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list