[R] Random Forest with highly imbalanced data

Liaw, Andy andy_liaw at merck.com
Wed May 12 21:54:26 CEST 2004


Breiman & Cutler's version 5 of the Fortran code implements a weighting
scheme that is more effective than the old classwt.  Basically:

1. class weights are used in computing the Gini index.
2. At terminal nodes, weighted votes are taken to determine the prediction
for the node.
3. Average weights within terminal nodes are computed, and used as weights
for the final weighted vote.

This has not been implemented in the R version of the package (and is one of
the reasons the version number for the package is still 4.x-y instead of
5.x-y).  Do note that one usually needs to `tune' the class weights a bit to
get the desired result.

The current version of the R package does offer the sampsize option; i.e.,
randomForest(..., sampsize=c(100, 100), ...) will draw 100 cases within each
class, with replacement, to grow each tree.  (This is the `down-sampling'
approach.)  We have found this to work quite well in general.

[Advertisement:  I will present both at the Interface in a few weeks.]

Best,
Andy

> From: Kel
> 
> Hi group,
> 
> I am trying to do a RF with approx 250,000
> cases.  My objective is to determine the risk factors
> of a person being readmitted to hospital (response=1)
> or else (response=0).  Only 10%, or 25,000 cases were
> readmitted.  I've heard about down-sampling and class
> weight approach and am wondering if R can do it.  Even
> some reference to articles will help.  
> 
> >From the statistical point of view, is there any rule
> of thumb of the positive/negative response ratio so
> that adjustment has to be applied?
> 
> Thank you so much.  
> 
> Regards,
> Kelvin
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>




More information about the R-help mailing list