[R] OT: a weighted rank-based, non-paired test statistic ?

Sat Jun 6 01:44:55 CEST 2009

On Friday 05 June 2009, Thomas Lumley wrote:
> On Fri, 5 Jun 2009, Dylan Beaudette wrote:
> > Is anyone aware of a rank-based, non-paired test such as the
> > Krustal-Wallis, that can accommodate weights?
>
> You don't say what sort of weights, but basically, no.
>
> Whether you have precision weights or sampling weights, the test will no
> longer be distribution-free.
>
> > Alternatively, would it make sense to simulate a dataset by duplicating
> > observations in proportion to their weight, and then using the
> > Krustal-Wallis test?
>
> No.
>
>  	-thomas
>
> Thomas Lumley			Assoc. Professor, Biostatistics
> tlumley at u.washington.edu	University of Washington, Seattle

Thanks,

Perhaps I am thinking about the problem incorrectly. Here are some more 
details. 

I have a series of values, each associated with a polygon that represents an 
area on the ground. A sample is made by collecting the polygons (i.e. the 
values associated with each polygon) and the area fraction of each polygon 
(i.e. the observation's "weight"). In this past I have used this method to 
compute a weighted average, or weighted standard deviation.

However, I now have a case where a subset of individuals from the region are 
selected, and re-assigned an area fraction. I would like to evaluate if the 
subsets are "representative" of the original data-- so far I have been 
comparing weighted means. 

Unfortunately, the data are prone to having highly skewed distributions-- 
possibly making my comparisons of weighted-means invalid. 

I posted a question about this a while back, but just in case here are some 
data that approximately resemble the problem at hand:

# generate some data
# note lack of balance
x.1 <- rnorm(n=100, mean=0, sd=1)
x.2 <- rnorm(n=10, mean=4, sd=3)
x.3 <- rnorm(n=10, mean=0, sd=1.5)

# generate some weights that sum to 1 for each treatment
# c/o this post: 
# https://stat.ethz.ch/pipermail/r-help/2008-July/167044.html
#
w.1 <- diff(c(0,sort(sample(seq(1,99,1),99,replace=T)),100)) / 100
w.2 <- diff(c(0,sort(sample(seq(1,99,1),9,replace=T)),100)) / 100
w.3 <- diff(c(0,sort(sample(seq(1,99,1),9,replace=T)),100)) / 100

# generate treatment labels:
trt <- factor(c(rep('baseline', 100), rep(c('data1','data2'), each=10)))

# combine into dataframe
d <- data.frame(values=c(x.1,x.2,x.3), wts=c(w.1,w.2,w.3), treatment=trt)

# compute means and wt.means
d.means <- sapply(by(d, d$treatment, function(i) mean(i$value)), '[')
d.wt.means <- sapply(by(d, d$treatment, function(i) weighted.mean(i$value, 
w=i$wts)), '[')

# quick check:
boxplot(values~ treatment, data=d, varwidth=TRUE, border=grey(0.5), 
main='symbol size is proportional to observation weight')
points(1:3, d.means, col='green', pch=15, cex=1.2)
points(1:3, d.wt.means, col='blue', pch=15, cex=1.2)
points(jitter(as.numeric(d$treatment)), d$values, col='red', 
cex=sqrt(d$wts*100))
legend('topleft', legend=c('original data','mean','wt.mean'), pch=c(1,15,15), 
col=c('red','green','blue'))

# another way of looking at the data:
library(lattice)
densityplot(~ values, groups=treatment, data=d, auto.key=list(columns=3), 
pch='|')

# comparison of means (without weights):
# effects are equal to treatment means
summary(lm(values ~ treatment, data=d))

# comparison with means
# effects are equal to treatment weighted-means
summary(lm(values ~ treatment, data=d, weights=wts))

Cheers,
Dylan

-- 
Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341