[R] Summary: Faster way for weighted matching
Frank E Harrell Jr
fharrell at virginia.edu
Thu Jan 16 16:08:06 CET 2003
I received some great ideas (see below) from a number of people to whom I am grateful. Here is the code I put together from many of their suggestions:
lx <- length(x)
lw <- length(w)
z <- matrix(abs( rep( x , lw ) - rep( w, each = lx ) ),
nrow=lw, ncol=lx, byrow=TRUE)
s <- pmax( abs( w - min(x) ), abs( w - max(x) ) )
z <- (1 - (z/rep(s,length=lx*lw))^3)^3
sums <- rowSums(z)
z <- z/rep(sums, length=lx*lw)
This improved the speed roughly 25% over the original code. Unfortunately I realized (once again) that any solution of this type based on outer( ) or something that generates an lx*lw matrix does not scale well for large problems because of memory usage. So I'm probably going to resort to a Ratfor/Fortran solution to this problem.
Again thanks to the people mentioned below.
------------------------
Tim Hesterberg <timh at insightful.com>:
z <- abs(outer(w, x, "-"))
>s <- apply(z, 1, max)
>z <- (1 - sweep(z, 1, s, FUN='/')^3)^3
Instead of sweep, use
z - rep(s, each=nrow(z))
>sums <- apply(z, 1, sum)
rowSums(z)
>z <- sweep(z, 1, sums, FUN='/')
z / rep(sums, each=nrow(z))
Nick.Ellis at csiro.au:
The sweeps could be replaced by simple division using the recycling rule,
but I haven't tested whether it is significantly faster.
Charles Berry <cberry at tajo.ucsd.edu>:
1) calculate the transpose of z. If z really is big, that should
reduce caching in subsequent calcs.
2) unroll the code in 'outer' and 'sweep' e.g. for 'outer'
l.w <- length(w)
l.x <- length(x)
z <- abs( rep( x , l.w ) - rep( w, each = l.x ) )
dim( z ) <- c( l.x, l.w )
3) find s as follows:
s <- pmax( abs( w - min(x) ), abs( w - max(x) ) )
4) use colSums() instead of apply(, , sum)
Thomas W Blackwell <tblackw at umich.edu>:
Matrix z is potentially enormous. On large matrices, it's
a familiar truism that as.vector(z %*% rep(1, dim(z)[2]))
is MUCH faster than apply(z, 1, sum).
The equivalent "inline" code for "max" is not so nice.
So here's an alternative version (untested):
z <- abs(outer(w, x, "-"))
wid <- seq(along=w)
top <- 1 + round(max(z), 0) # we know z is nonnegative by construction
s <- z[ sort.list((z / top) + matrix(wid, nrow=dim(z)[1], ncol=dim(z)[2], byrow=F)[wid * dim(z)[2]] ]
z <- (1 - (z / matrix(s, nrow=dim(z)[1], ncol=dim(z)[2], byrow=F))^3)^3
sums <- as.vector(z %*% rep(1, dim(z)[2]))
z <- z / matrix(sums, nrow=dim(z)[1], ncol=dim(z)[2], byrow=F)
Excruciatingly ugly, but it works. Depends how much you're
willing to sacrifice transparent code in favor of speed.
(And, clearly I could save eight calls to dim(z) by defining
a couple of temporary variables.)
But surely, you know these hacks from way back. [Not necessary Thomas! -Frank]
Original post:
> For each element in w I want to find a good match (subscript number) of an element in x. x and w can be long. Instead of just finding the closest match I want to use weighted multinomial sampling (which I've already figured out once I have the probabilities) where the probabilities come from the tricube function of absolute differences between donor and target values, but normalized to sum to one, and using the maximum absolute difference as the scaling factor. This is similar to the loess weighting function with f=1. Here's code that works, to get the probability matrix to use for sampling:
>
> z <- abs(outer(w, x, "-"))
> s <- apply(z, 1, max)
> z <- (1 - sweep(z, 1, s, FUN='/')^3)^3
> sums <- apply(z, 1, sum)
> z <- sweep(z, 1, sums, FUN='/')
>
> Example:
> w <- c(1,2,3,7)
> x <- c(0,1.5,3)
> z <- abs(outer(w,x,"-"))
> > z
> [,1] [,2] [,3]
> [1,] 1 0.5 2
> [2,] 2 0.5 1
> [3,] 3 1.5 0
> [4,] 7 5.5 4
> s <- apply(z, 1, max)
> z <- (1 - sweep(z, 1, s, FUN='/')^3)^3
> z
> [1,] 0.6699219 0.9538536 0.0000000
> [2,] 0.0000000 0.9538536 0.6699219
> [3,] 0.0000000 0.6699219 1.0000000
> [4,] 0.0000000 0.1365445 0.5381833
> sums <- apply(z, 1, sum)
> z <- sweep(z, 1, sums, FUN='/')
> z # each row represents multinomial probabilities summing to 1
> [1,] 0.4125705 0.5874295 0.0000000
> [2,] 0.0000000 0.5874295 0.4125705
> [3,] 0.0000000 0.4011696 0.5988304
> [4,] 0.0000000 0.2023697 0.7976303
>
>
> The code is moderately fast. Does anyone know of a significantly faster method or have any comments on the choice of weighting function for such sampling? This will be used in the context of predictive mean matching for multiple imputation. Thanks - Frank
--
Frank E Harrell Jr Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat
More information about the R-help
mailing list