[R] Re: Oja median

Mon Aug 18 17:17:10 CEST 2003

I am shocked and dismayed (and the term hasn't even started yet ;-) )
that none of you have turned in the weekend homework problem that
I assigned last Friday.  At the risk of further embarrassment I am
posting an answer in the hope that this will inspire someone to suggest
some improvements.  In particular you will see that the loop in the
function "cofactors" when subjected to the apply is painfully slow.
Using Rprof on an example with n=50 and p=3 shows that 160 seconds of the
167 needed were spent in this apply.  Yes, I'm aware that this could
be recoded in [C, fortran, ...], but that would be considered cheating.

"mean.wilks" <- function(x){
	# Computes the column means of the matrix x -- very sloooowly.
	n <- dim(x)[1]
        p <- dim(x)[2]
        A <- t(combn(1:n,p))
	X <- NULL
	for(i in 1:p)
		X <- cbind(X,x[A[,i],])
	oja.ize <- function(v)cofactors(matrix(v,sqrt(length(v))))
	A <- t(apply(X,1,oja.ize))
	coef(lm(-A[,1]~A[,-1]-1))
	}
"cofactors" <- function(A){
	B <- rbind(1,cbind(A,1))
	p <- ncol(B)
	x <- rep(0,p)
	for(i in 1:p)
		x[i] <- ((-1)^(i+p)) *det(B[-i,-p])
	return(x)
}

url:	www.econ.uiuc.edu/~roger/my.html	Roger Koenker
email	rkoenker at uiuc.edu			Department of Economics
vox: 	217-333-4558				University of Illinois
fax:   	217-244-6678				Champaign, IL 61820

On Fri, 15 Aug 2003, Roger Koenker wrote:

> I discovered recently that the phrase "Oja median" produces no hits in
> Jonathan Baron's very valuable R search engine. I found this surprising
> since I've long regarded this idea as one of the more interesting notions
> in the multivariate robustness literature. To begin to remedy this oversight
> I wrote a bivariate version and then decided that writing a general p-variate
> version might make a nice weekend programming puzzle for R-help.
>
> Here are a few more details:
>
> The Oja median of n  p-variate observations minimizes over theta
> in R^p the sum of the volumes of the simplices formed by theta,
> and p of the observed points, the sum being taken over all n choose p
> groups of p observations.  Thus, in the bivariate case we are minimizing
> the sum of the areas of all triangles formed by the the point theta
> and pairs of observations.  Here is a simple bivariate implementation:
>
> oja.median <-function(x) {
> 	#bivariate version -- x is assumed to be an n by 2 matrix
> 	require(quantreg)
>         n <- dim(x)[1]
>         A <- matrix(rep(1:n, n), n)
>         i <- A[col(A) < row(A)]
>         j <- A[n + 1. - col(A) > row(A)]
>         xx <- cbind(x[i,  ], x[j,  ])
>         y <- xx[, 1] * xx[, 4] - xx[, 2] * xx[, 3]
>         z1 <- (xx[, 4] - xx[, 2])
>         z2 <-  - (xx[, 3] - xx[, 1])
>         return(rq(y~cbind(z1, z2)-1)$coef)
>         }
>
> To understand the strategy, note that the area of the triangle formed
> by the points x_i = (x_i1,x_i2), x_j = (x_j1,x_j2),
> and theta = (theta_1,theta_2) is given by the determinant,
>
>                                     | 1    1     1   |
> 	Delta(x_i, x_j, theta) = .5 |y_i1 yj1 theta_1|.
> 	                            |y_i2 yj2 theta_2|
>
> Expanding the determinant in the unknown parameters theta gives
> the l1 regression formulation.  Remarkably, a result of Wilks says
> that if the call to rq() is replaced with a call to lm() you get
> the sample mean -- this gives an impressively inefficient least
> squares regression based alternative to apply(x,2,mean)!
> It also provides a useful debugging check for proposed algorithms.
>
> Obviously, the expansion of the determinant gives the same formulation
> for p>2, the challenge is to find a clean way to generate the
> design matrix and response vector for the general setting.
>
> Bon weekend!
>
> url:	www.econ.uiuc.edu/~roger/my.html	Roger Koenker
> email	rkoenker at uiuc.edu			Department of Economics
> vox: 	217-333-4558				University of Illinois
> fax:   	217-244-6678				Champaign, IL 61820
>
>
>