[R] outliers using Random Forest

Liaw, Andy andy_liaw at merck.com
Mon Apr 19 14:30:42 CEST 2004


> From: Edgar Acuna [mailto:edgar at cs.uprm.edu] 
> 
> Dear Andy,
> Thanks for your quick answer. I increased the number of trees and the
> outlyingness measure got more stable. But still I do not know if I am
> working with the raw measure or with the normalized measure mentioned
> in the Breiman's Wald lecture. The normalized measure nout is
> 
> nout=(nout-med)/mean(abs(nout-med))
> where med is the median of the class containing the case correponding
> to nout.

Looking at the Fortran subroutine `locateout' in rfsub.f, yes, they are
normalized.  (That part of the code is not changed from Breiman & Cutler's
original.)

Andy

 
> Best regards
> Edgar Acuna
> 
> On Sun, 18 Apr 2004, Liaw, Andy wrote:
> 
> > The thing to do is probably:
> >
> > 1. Use fairly large number of trees (e.g., 1000).
> > 2. Run a few times and average the results.
> >
> > The reason for the instability is sort of two fold:
> >
> > 1. The random forest algorithm itself is based on 
> randomization.  That's why
> > it's probably a good idea to have 500-1000 trees to get more stable
> > proximity measures (of which the outlying measures are based on).
> >
> > 2. If you are running randomForest in unsupervised mode 
> (i.e., not giving it
> > the class labels), then the program treats the data as 
> "class 1", creates a
> > synthetic "class 2", and run the classification algorithm to get the
> > proximity measures.  You probably need to run the algorithm 
> a few times so
> > that the result will be based on several simulated data, 
> instead of just
> > one.
> >
> > HTH,
> > Andy
> >
> > > From: Edgar Acuna
> > >
> > > Hello,
> > > Does anybody know if the outscale option of randomForest 
> yields the
> > > standarized version of the outlier measure for each case? or
> > > the results
> > > are only the raw values. Also I have notice that this 
> measure presents
> > > very high variability. I mean if I repeat the experiment I am
> > > getting very
> > > different values for this measure and it is hard to flag 
> the outliers.
> > > This does not happen with two other criteria than I am 
> using: LOF and
> > > Bay's Orca. I am getting several cases that can be considered
> > > as outliers
> > > with both approaches.
> > >  I run my experiments  using Bupa and Diabetes available at
> > > UCI Machine database repository.
> > >
> > > Thanks in advance for any response.
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> > >
> >
> >
> > 
> --------------------------------------------------------------
> ----------------
> > Notice:  This e-mail message, together with any 
> attachments, contains
> > information of Merck & Co., Inc. (One Merck Drive, 
> Whitehouse Station, New
> > Jersey, USA 08889), and/or its affiliates (which may be 
> known outside the
> > United States as Merck Frosst, Merck Sharp & Dohme or MSD 
> and in Japan as
> > Banyu) that may be confidential, proprietary copyrighted 
> and/or legally
> > privileged. It is intended solely for the use of the 
> individual or entity
> > named on this message.  If you are not the intended 
> recipient, and have
> > received this message in error, please notify us 
> immediately by reply e-mail
> > and then delete it from your system.
> > 
> --------------------------------------------------------------
> ----------------
> >
> 
> 
> 


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}




More information about the R-help mailing list