[R] removing outlier --> use robust regression !

Martin Maechler maechler at stat.math.ethz.ch
Tue Sep 15 10:37:18 CEST 2015

```>>>>> Juli  <Julianeleuschner at web.de>
>>>>>     on Sat, 12 Sep 2015 02:32:39 -0700 writes:

> Hi Jim, thank you for your help. :)

> My point is, that there are outlier and I don´t really
> know how to deal with that.

> I need the dataframe for a regression and read often that
> only a few outlier can change your results very much. In
> addition, regression diacnostics didn´t indcate me the
> best results.  Yes, and I know its not the core of
> statistics to work in a way you get results you would
> like to have ;).

> So what is your suggestion?

Use robust regression, e.g.
MASS::rlm()  {part of every R installation},
or a somewhat better and more sophisticated version.
lmrob()  from package 'robustbase' {yes, shameless promotion}.

Further:

1) Removing outliers is not at all the best way to deal with such
problems (intuitively, because it is a *dis*continuous method).
Rather they should be downweighted (continuously, as it
happens with methods used in  rlm() or lmrob() see above)

2) Removing outliers in *multivariate* setting, if you want to do
it in spite of 1)  by using univariate treatment {each column
separately as you do here} is often completely insufficient.  E.g.
the bivariate outlier  in
xy <- cbind(x= c(2,1:9), y=c(8,1:9));  plot(xy)
cannot be found by looking at 'x' and 'y' separately.

3) If, in spite of 1) and 2) you are considering univariate
treatment, using mean() and sd() for detecting univariate outliers
has been proven to be insufficient more than 50 years ago (*1), and
if one looks closer into the literature (say "L_1") even
considerably longer ago.
what you should do. Hampel's rule (*3)
proposes declaring outliers for the observations outside

*1 Tukey, J. W. (1960) A survey of sampling from contaminated distributions.
In Contributions to Probability and Statistics,
eds I. Olkin, S. Ghurye, W. Hoeffding, W. Madow and H. Mann,
pp. 448–485. Stanford: Stanford University Press.

*2 Another (less robust, but still infinitely better than mean/sd) approach
uses  median() and IQR() which is
basically/approximately what boxplots do to identify outliers.

*3 Frank R. Hampel (1985)
The Breakdown Points of the Mean Combined With Some Rejection Rules,
Technometrics, 27:2, 95-107
[ http://dx.doi.org/10.1080/00401706.1985.10488027 ]