[R] Outlier removal techniques

Nordlund, Dan (DSHS/RDA) NordlDJ at dshs.wa.gov
Thu Feb 9 18:45:20 CET 2012


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Frank Harrell
> Sent: Thursday, February 09, 2012 9:19 AM
> To: r-help at r-project.org
> Subject: Re: [R] Outlier removal techniques
> 
> I wonder why it is still standard practice in some circles to search
> for
> "outliers" as opposed to using robust/resistent methods.
> 
> Here is a great paper with a scientific approach to "outliers":
> 
> @Article{fin06cal,
>   author = 		 {Finney, David J.},
>   title = 		 {Calibration guidelines challenge outlier
> practices},
>   journal = 	 The American Statistician,
>   year = 		 2006,
>   volume =		 60,
>   pages =		 {309-313},
>   annote =		 {anticoagulant
> therapy;bias;causation;ethics;objectivity;outliers;guidelines for
> treatment of outliers;overview of types of outliers;letter to the
> editor and
> reply 61:187 May 2007}
> }
> 
> Frank
> 
> Rich Shepard wrote
> >
> > On Thu, 9 Feb 2012, mails wrote:
> >
> >> I need to analyse a data matrix with dimensions of 30x100. Before
> >> analysing the data there is, however, a need to remove outliers from
> the
> >> data. I read quite a lot about outlier removal already and I think
> the
> >> most common technique for that seems to be Principal Component
> Analysis
> >> (PCA). However, I think that these technqiue is quite subjective.
> When is
> >> an outlier an outlier? I uploaded an example PCA plot here:
> >
> >    Those more expert than I will certainly provide answers. What I do
> will
> > new data is create box-and-whisker plots (I use the lattice package)
> which
> > defines outliers as those data beyond 1.5x the first or third
> quartile
> > values.
> >
> >    No one but you can answer your question on when an outlier is an
> > outlier.
> > It depends on your data set and the context of the data. For example,
> a
> > water chemistry value that far exceeds a regulartory threshold might
> be
> > meaningful in the context of a one-off excursion (in which case it's
> not
> > an
> > outlier but a real data point) or it might result from a handling,
> > instrumentation, or analytical error (in which case toss it as an
> > outlier).
> >
> > Rich
> >
> > ______________________________________________
> > R-help@ mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 

I would echo what Frank says.  I would also add that in the absence of demonstrated measurement/recording errors, there is good reason to "explain" the extreme values as well as the  typical values.  If a model can't deal with extreme values, then it may be good enough for some purposes, but it is not a "complete" explanation and may fail at the worst time.  I would highly recommend the book "The Black Swan" by Nassim Nicholas Taleb (NOT the ballet story).


Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204




More information about the R-help mailing list