[R] svyglm in the survey package: large regression models with replicate weights give infinite standard errors
Adam C Sales
asales at utexas.edu
Wed Nov 2 21:00:54 CET 2016
I'm running some regressions on data from the American Community
Survey, with replicate weights for variance estimation, using the
svyglm function from the survey package.
For my larger regressions, all of the standard errors were reported as Inf.
I did a bit of digging through the code, and figure out why that was:
the svrepglm function (see here:
https://github.com/cran/survey/blob/9aea3247ea1966667aba3efeeaf4a574e819f924/R/surveyrep.R)
calculates standard errors using the formula I would expect, but then
adjusts the "degrees of freedom" for the fit with the following line:
full$df.residual<-degf(design)+1-length(coef(full)[!is.na(coef(full))])
(where 'full' is the name of the model and 'design' is the specified
survey design)
It turns out that degf(design) returns the rank of the weight matrix,
which, in the case of the ACS at least, is the number of replicate
weights, 80. If there are more than 80 columns in the regression
design matrix, the df.residual entry in the model is negative, which
causes the summary.glm() function to report infinite standard errors.
My question is this:
Is this a bug in the survey package, or are the standard errors really
unidentified for large regression models?
Or am I using the wrong settings?
If it's just a bug fixing it is easy enough, and that's what I've been
doing, but I'd like to make sure I'm not glossing over a serious
statistical limitation of the data.
Here's a small working example. It's artificial in that I chose an
arbitrary regression that had enough regressors. It takes a long time
to run.
### step 1: download from
http://www2.census.gov/programs-surveys/acs/data/pums/2014/1-Year/csv_pus.zip
### step 2: unzip, put file ss14pusa.csv in working directory
library(survey)
pums <- read.csv('ss14pusa.csv')
## some of the replication weights are negative. you might want to set
them to zero:
weightcols <- grep('pwgtp[1-9]',names(pums))
pums[,weightcols] <- apply(pums[,weightcols],2,function(x) ifelse(x<0,0,x))
## there's some lack of clarity online on how to set these things up.
Either way you get infinite standard errors
## Anthony Damico recommends (see:
https://stat.ethz.ch/pipermail/r-help/2014-September/422133.html):
options( "survey.replicates.mse" = TRUE )
des <- svrepdesign(data=pums,weights=~PWGTP,repweights="pwgtp[1-9]",scale=4/80,rscales=rep(1,80))
## However, I think the following is correct:
des <- svrepdesign(data=pums,weights=~PWGTP,repweights="pwgtp[1-9]",scale=4/80,rscales=rep(1,80),type='Fay',rho=0.5)
## Anyway, here's the problem:
summary(mod <- svyglm(AGEP~as.factor(ST)+as.factor(CIT)+as.factor(MAR)+as.factor(RELP)+as.factor(HISP)+as.factor(RAC2P),design=des))
--
UT College of Education
SMARTER Consulting
More information about the R-help
mailing list