[R] Oja median

Roger Koenker roger at ysidro.econ.uiuc.edu
Fri Aug 15 17:29:21 CEST 2003


I discovered recently that the phrase "Oja median" produces no hits in
Jonathan Baron's very valuable R search engine. I found this surprising
since I've long regarded this idea as one of the more interesting notions
in the multivariate robustness literature. To begin to remedy this oversight
I wrote a bivariate version and then decided that writing a general p-variate
version might make a nice weekend programming puzzle for R-help.

Here are a few more details:

The Oja median of n  p-variate observations minimizes over theta
in R^p the sum of the volumes of the simplices formed by theta,
and p of the observed points, the sum being taken over all n choose p
groups of p observations.  Thus, in the bivariate case we are minimizing
the sum of the areas of all triangles formed by the the point theta
and pairs of observations.  Here is a simple bivariate implementation:

oja.median <-function(x) {
	#bivariate version -- x is assumed to be an n by 2 matrix
	require(quantreg)
        n <- dim(x)[1]
        A <- matrix(rep(1:n, n), n)
        i <- A[col(A) < row(A)]
        j <- A[n + 1. - col(A) > row(A)]
        xx <- cbind(x[i,  ], x[j,  ])
        y <- xx[, 1] * xx[, 4] - xx[, 2] * xx[, 3]
        z1 <- (xx[, 4] - xx[, 2])
        z2 <-  - (xx[, 3] - xx[, 1])
        return(rq(y~cbind(z1, z2)-1)$coef)
        }

To understand the strategy, note that the area of the triangle formed
by the points x_i = (x_i1,x_i2), x_j = (x_j1,x_j2),
and theta = (theta_1,theta_2) is given by the determinant,

                                    | 1    1     1   |
	Delta(x_i, x_j, theta) = .5 |y_i1 yj1 theta_1|.
	                            |y_i2 yj2 theta_2|

Expanding the determinant in the unknown parameters theta gives
the l1 regression formulation.  Remarkably, a result of Wilks says
that if the call to rq() is replaced with a call to lm() you get
the sample mean -- this gives an impressively inefficient least
squares regression based alternative to apply(x,2,mean)!
It also provides a useful debugging check for proposed algorithms.

Obviously, the expansion of the determinant gives the same formulation
for p>2, the challenge is to find a clean way to generate the
design matrix and response vector for the general setting.

Bon weekend!

url:	www.econ.uiuc.edu/~roger/my.html	Roger Koenker
email	rkoenker at uiuc.edu			Department of Economics
vox: 	217-333-4558				University of Illinois
fax:   	217-244-6678				Champaign, IL 61820




More information about the R-help mailing list