[R] confidence intervals for differences in proportions from complex survey design?

Mon May 11 06:40:19 CEST 2015

All:

I need to generate confidence intervals for differences in proportions using data from a complex survey design. An example follows where I attempt to estimate the difference in depression prevalence by sex.

# Data might look something like this:
Dfr<-data.frame(depression=sample(c("yes","no"), size=30, replace=TRUE),
	sex=sample(c("M","F"), size=30, replace=TRUE),
	cluster=rep(1:10, times=3),
	stratum=rep(1:5, each=2, times=3),
	pweight=runif(n=30, min=1, max=3))
Dfr
library(survey)
msdesign<-svydesign(id=~cluster, strata=~stratum, weights=~pweight, nest=TRUE,
	data=Dfr)
# When searching online, one recommendation was to use svyglm() to generate an
# approximation as follows:
confint(with(Dfr, svyglm(I(depression=="yes")~sex, family=gaussian(link=identity), 
	msdesign)), level=0.95, method="Wald")

This question has been asked before on the listserv (circa 2007) and I contacted the original poster, who indicated that they never received a reply.

Here is the question as described by the original poster:

"I'm trying to get confidence intervals of proportions (sometimes for 
subgroups) estimated from complex survey data. Because a function like 
prop.test() does not exist for the "survey" package I tried the following:

1) Define a survey object (PSU of clustered sample, population weights);
2) Use svyglm() of the package "survey" to estimate a binary logistic 
regression (family='binomial'): For the confidence interval of a single 
proportion regress the binary dependent variable on a constant (1), for 
confidence intervals of that variable for subgroups regress this 
variable on the groups (factor) variable;
3) Use predict() to obtain estimated logits and the respective standard 
errors (mod.dat specifying either the constant or the subgroups):

    pred=predict(model,mod.dat,type='link',se.fit=T)

and apply the following to obtain the proportion with its confidence 
intervals (for example, for conf.level=.95):

    lo.e = pred[1:length(pred)]-qnorm((1+conf.level)/2)*SE(pred)
    hi.e = pred[1:length(pred)]+qnorm((1+conf.level)/2)*SE(pred)
    prop = 1/(1+exp(-pred[1:length(pred)]))
    lo = 1/(1+exp(-lo.e))
    hi = 1/(1+exp(-hi.e))

I think that in that way I get CI's based on asymptotic normality - 
either for a single proportion or split up into subgroups.

Question: Is this a correct or a defensible procedure? Or should I use a 
different approach? Note that this approach should also allow to 
estimate CI's for proportions of subgroups taking into account the 
complex survey design."

Thanks in advance for any help that you can provide.

Tony

------------------------------------------------------------------------------
Tony N. Brown, Ph.D.
Associate Chair and Associate Professor of Sociology
Google Scholar Profile: http://tinyurl.com/lozlht8
LinkedIn Profile: https://www.linkedin.com/pub/tony-nicholas-brown/a6/64/31a