[R] puzzling classical Mahalanobis distances from covMcd() {robustbase}

David L Carlson dcarlson at tamu.edu
Sat Jul 28 23:44:01 CEST 2012


The values should probably be labeled "initial" instead of "raw" which is
how they are labeled in the source. The Details section of manual indicates
that the first step is to identify a subset of the original data between .5
and 1 whose covariance matrix has the lowest possible determinant. The next
paragraph:

"The raw MCD estimate of location is then the average of these h points,
whereas the raw MCD estimate of scatter is their covariance matrix,
multiplied by a consistency factor and a finite sample correction factor (to
make it consistent at the normal model and unbiased at small samples)."

Following your example:
> set.seed(42)
> x <- matrix(rnorm(10*3), ncol = 3)
> xmeans <- colMeans(x)
> Sx <- cov(x)
> D2rb <- covMcd(x)
> D2rb$raw.weights
 [1] 0 1 1 1 1 1 0 1 0 1  <== Note that the raw weights eliminate obs 1, 7,
and 9
> xmeans; D2rb$raw.center 
[1]  0.5472968 -0.1634567 -0.1780795        <== Compare original means 
[1]  0.08172336 -0.03067387 -0.23956925         and "raw" means
> colMeans(x[as.logical(D2rb$raw.weights),]) <== means with 1, 7, and 9
eliminated
[1]  0.08172336 -0.03067387 -0.23956925      <== This matches
D2rb$raw.center

So the "raw" values are taken for a subset, h, which includes observations
2, 3, 4, 5, 6, 8, and 10. Given that the raw.center and raw.cov are based on
a subset of the original data, the mahalanobis distances will not be the
same either.

----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Fraser D. Neiman
> Sent: Friday, July 27, 2012 7:16 AM
> To: r-help at r-project.org
> Subject: [R] puzzling classical Mahalanobis distances from covMcd()
> {robustbase}
> 
> Greetings,
> 
> I am puzzled about why the _classical_ Mahalanobis distances that I get
> using
> the {stats} mahalanobis() function do not match the distances I get
> from the
> {robustbase} covMcd() function. Here is an example:
> 
> x <- matrix(rnorm(10*3), ncol = 3)
> 
> #here is the {stats} result:
> Sx <- cov(x)
> D2 <- mahalanobis(x, colMeans(x), Sx)
> D2
> 
> [1] 1.5135795 1.3761046 1.0367444 1.8111585 4.3038621 5.3195918
> 3.2798665
> 5.7559301
>  [9] 2.2172150 0.3859475
> 
> 
> #here is the {robustbase} result
> Library(robustbase)
> D2rb<- covMcd(x)
> D2rb$raw.mah
> 
> [1] 0.7737193 1.1177445 0.7290794 0.6275703 3.5517622 6.0334350
> 1.0582663
> 5.7169250
>  [9] 0.9420184 0.4210470
> 
> According to the help file for covMcd{robustbase}
> 
> raw.mah	mahalanobis distances of the observations based on the raw
> estimate of
> the location and scatter.
> 
> So I think the second set of numbers should match the first. But they
> do not.
> What am I missing here?
> 
> Thanks, Fraser
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list