[R] Outlier removal techniques

Frank Harrell f.harrell at vanderbilt.edu
Thu Feb 9 18:19:04 CET 2012


I wonder why it is still standard practice in some circles to search for
"outliers" as opposed to using robust/resistent methods.  

Here is a great paper with a scientific approach to "outliers":

@Article{fin06cal,
  author = 		 {Finney, David J.},
  title = 		 {Calibration guidelines challenge outlier practices},
  journal = 	 The American Statistician,
  year = 		 2006,
  volume =		 60,
  pages =		 {309-313},
  annote =		 {anticoagulant
therapy;bias;causation;ethics;objectivity;outliers;guidelines for
treatment of outliers;overview of types of outliers;letter to the editor and
reply 61:187 May 2007}
}

Frank

Rich Shepard wrote
> 
> On Thu, 9 Feb 2012, mails wrote:
> 
>> I need to analyse a data matrix with dimensions of 30x100. Before
>> analysing the data there is, however, a need to remove outliers from the
>> data. I read quite a lot about outlier removal already and I think the
>> most common technique for that seems to be Principal Component Analysis
>> (PCA). However, I think that these technqiue is quite subjective. When is
>> an outlier an outlier? I uploaded an example PCA plot here:
> 
>    Those more expert than I will certainly provide answers. What I do will
> new data is create box-and-whisker plots (I use the lattice package) which
> defines outliers as those data beyond 1.5x the first or third quartile
> values.
> 
>    No one but you can answer your question on when an outlier is an
> outlier.
> It depends on your data set and the context of the data. For example, a
> water chemistry value that far exceeds a regulartory threshold might be
> meaningful in the context of a one-off excursion (in which case it's not
> an
> outlier but a real data point) or it might result from a handling,
> instrumentation, or analytical error (in which case toss it as an
> outlier).
> 
> Rich
> 
> ______________________________________________
> R-help@ mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: http://r.789695.n4.nabble.com/Outlier-removal-techniques-tp4372652p4373592.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list