[R] outlier identification: is there a redundancy-invariant substitution for mahalanobis distances?

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Jan 21 18:35:22 CET 2004


Your extra column is not redundant: it adds an extra column of 
information, and outliers in that column after removing the effects of the 
other columns are still multivariate outliers.

Effectively you have added one more dimension to the sphered point cloud, 
and mahalanobis distance is Euclidean distance after sphering.

On Wed, 21 Jan 2004, "Jens Oehlschlägel" wrote:

> 
> 
> Dear R-experts,
> 
> Searching the help archives I found a recommendation to do multivariate
> outlier identification by mahalanobis distances based on a robustly estimated
> covariance matrix and compare the resulting distances to a chi^2-distribution
> with p (number of your variables) degrees of freedom. I understand that
> compared to euclidean distances this has the advantage of being scale-invariant.
> However, it seems that such mahalanobis distances are not invariant to
> redundancies: adding a highly collinear variable changes the mahalanobis distances
> (see code below). Isn't also the comparision to chi^2 assuming that all
> variables are independent?

No.  It assumes that *after sphering* all variables are independent, which 
is true by definition for a joint normal distribution.

> Can anyone recommend a procedure to calculate distances and identify
> multivariate outliers which is invariant to the degree of collinearity?

I don't think that makes any sense, given what is usually meant by
`multivariate outliers', outliers in any direction in the point cloud.

[...]


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list