[R] Comparing spatial distributions - permutation test implementation
JiHO
jo.lists at gmail.com
Wed May 20 17:34:24 CEST 2009
Hello everyone,
I am looking at the joint spatial distribution of 2 kinds of organisms
(estimated on a grid of points) and want to test for significant
association or dissociation.
My first question is: do you know a nice technique to do that,
considering that I have a limited number of points (36) but that they
are repeated (4 times)? I did GLMs to test for correlations between
the two (hence forgetting about the spatial aspect of it) and was
previously pointed to the SADIE software. Would there be anything
explicitly spatial and available in R please?
Then, Syrjala's test[1] seems appropriate and tests for differences in
distribution. It computes a Cramér-von Mises-type statistic and tests
its significance with a permutation test.
I implemented the test in R and posted the code on these mailing
lists[2]. Some people checked it and confirmed that the statistic
gives correct results but my estimation of the p-value does not match
the one predicted with the orignal software from Syrjala. I don't know
what I am doing wrong. The permutation test is described by Syrjala as:
(...) Under the null hypothesis,
at a given sampling location (x_k, y_k), either density ob-
servation y_i(x_k, y_k), i = 1, 2, is equally likely for each
population. Thus, for a given data set, the distribution
of the test statistic can be constructed by calculating
the value of the test statistic for all 2^k pairwise per-
mutations of the data set. (...) The level of signif-
icance of a specific realization of the test statistic T is
determined from its position in the ordered set of test
statistic values from all 2^k permutations. (...)
My understanding is that, for each permutation I should choose a
random number of points (between 1 and k), swap the values for species
1 and species 2 at those points, and recompute the test on the new
data. But this does not work :/ . Here is my code and associated data
from Syrjala (for which I have reference values). Any advice would be
very welcome (in particular if there is a way to leverage boot() for
this).
NB: computing the 1000 permutations can be a bit lengthy, but
fortunately, by using plyr, you get a nice progress bar to look at!
syrjala.stat <- function(x, y=NULL, var1=NULL, var2=NULL)
#
# Compute Syrjala statistic
# x, y coordinates
# var1, var2 value of 2 parameters both measured at (x,y) points
# NB: x can also be a data.frame/matrix containing x,y,var1,var2 as
columns
#
{
# Input checks
if (!is.null(ncol(x))) {
if (ncol(x) == 4) {
names(x) = c("x","y","var1","var2")
dat = x
} else {
stop("Wrong number of columns in argument x")
}
} else {
dat = data.frame(x, y, var1, var2)
}
# Normalize abundances
dat$var1 = dat$var1/sum(dat$var1)
dat$var2 = dat$var2/sum(dat$var2)
# For each point (each line of dat)
# compute the squared difference in gammas from each origin
meanSqDiff = apply(dat, 1, function(d, coord, variab) {
north = (coord$x>=d[1])
east = (coord$y>=d[2])
south = (coord$x<=d[1])
west = (coord$y<=d[2])
return( mean( c(
(diff(sapply(variab[(north & east),], sum)))^2,
(diff(sapply(variab[(south & east),], sum)))^2,
(diff(sapply(variab[(south & west),], sum)))^2,
(diff(sapply(variab[(north & west),], sum)))^2
) )
)
}, dat[,c("x","y")], dat[,c("var1","var2")])
# Compute the statistic (i.e. sum of mean squared differences)
return(sum(meanSqDiff))
}
# Get data online : http://dl.getdropbox.com/u/1047321/syrjala_data_cod.csv
system("curl http://dl.getdropbox.com/u/1047321/syrjala_data_cod.csv >
syrjala_data_cod.csv")
dataCod = read.csv(file = "syrjala_data_cod.csv", header = TRUE)
# Normalize abundances
dataCod$var1 = dataCod$var1/sum(dataCod$var1)
dataCod$var2 = dataCod$var2/sum(dataCod$var2)
# Number of permutations
nperm = 1000
# Create nperm-1 replicates of the data (one is the original
observation)
d = rep(list(dataCod), nperm-1)
# Compute number of observations before to avoid doing that for every
replicate
n = nrow(dataCod)
require(plyr)
# Permute some observations and compute the syrjala stat for each
permutation
psis = ldply(d, .fun=function(x, n){
# choose indices of observations to swap
idx = sample(1:n, runif(1, min=1, max=n))
# swap observations
x[idx, 3:4] = x[idx, 4:3]
# compute syrjala stat
return(syrjala.stat(x))
}, n, .progress="text")
}
# Compute the syrjala stat for the observations
psi = syrjala.stat(dataCod)
# Estimate the pvalue
pvalue = (sum(psis>=psi)+1)/nperm
psi
pvalue
# Should be:
# statistic = 0.224
# p-value = 0.1900
Thank you very much in advance. Sincerely,
JiHO
---
http://jo.irisson.free.fr/
[1] A statistical test for a difference between the spatial
distributions of two populations. Syrjala SE. Ecology. 1996;77(1):75–80.
http://dl.getdropbox.com/u/1047321/Syrjala1996.pdf
[2] https://stat.ethz.ch/pipermail/r-sig-geo/2008-February/
thread.html#3137
More information about the R-help
mailing list