[R] outliers using Random Forest

Liaw, Andy andy_liaw at merck.com
Sun Apr 18 22:24:53 CEST 2004


The thing to do is probably:

1. Use fairly large number of trees (e.g., 1000).
2. Run a few times and average the results.

The reason for the instability is sort of two fold:

1. The random forest algorithm itself is based on randomization.  That's why
it's probably a good idea to have 500-1000 trees to get more stable
proximity measures (of which the outlying measures are based on).

2. If you are running randomForest in unsupervised mode (i.e., not giving it
the class labels), then the program treats the data as "class 1", creates a
synthetic "class 2", and run the classification algorithm to get the
proximity measures.  You probably need to run the algorithm a few times so
that the result will be based on several simulated data, instead of just
one.

HTH,
Andy

> From: Edgar Acuna
> 
> Hello,
> Does anybody know if the outscale option of randomForest yields the
> standarized version of the outlier measure for each case? or 
> the results
> are only the raw values. Also I have notice that this measure presents
> very high variability. I mean if I repeat the experiment I am 
> getting very
> different values for this measure and it is hard to flag the outliers.
> This does not happen with two other criteria than I am using: LOF and
> Bay's Orca. I am getting several cases that can be considered 
> as outliers
> with both approaches.
>  I run my experiments  using Bupa and Diabetes available at
> UCI Machine database repository.
> 
> Thanks in advance for any response.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
> 


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}




More information about the R-help mailing list