[R] svy / weighted regression

Mon Oct 12 15:12:13 CEST 2009

Dear Peter,

Thanks for the input. The zero rates in some strata occurs because
sampling depended on case status: In Finland only 50% of the non-cases
were sampled, while all others were sampled with 100% probability.

Best
Laust

On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard
<p.dalgaard at biostat.ku.dk> wrote:
> Sorry, forgot to "reply all"...
>
> Laust wrote:
>>
>> Dear list,
>>
>> I am trying to set up a propensity-weighted regression using the
>> survey package. Most of my population is sampled with a sampling
>> probability of one (that is, I have the full population). However, for
>> a subset of the data I have only a 50% sample of the full population.
>> In previous work on the data, I analyzed these data using SAS and
>> STATA. In those packages I used a propensity weight of 1/[sampling
>> probability] in various generalized linear regression-procedures, but
>> I am having trouble setting this up. I bet the solution is simple, but
>> I’m a R newbie. Code to illustrate my problem below.
>
> Hi Laust,
>
> You probably need the package author to explain fully, but as far as I
> can see, the crux is that a dispersion parameter is being used, based on
> Pearson residuals, even in the Poisson case (i.e. you effectively get
> the same result as with quasipoisson()).
>
> I don't know what the rationale is for this, but it is clear that with
> your data, an estimated dispersion parameter is going to be large. E.g.
> the data has both 0 cases in 750000 person-years and 1000 cases in 5000
> person-years for Denmark, and in your model they are supposed to have
> the same Poisson rate.
>
> summary.svyglm starts off with
>
>    est.disp <- TRUE
>
> and AFAICS there is no way it can get set to FALSE.  Knowing Thomas,
> there is probably a perfectly good reason not to just set the dispersion
> to 1, but I don't get it either...
>
>>
>> Thanks
>> Laust
>>
>> # loading survey
>> library(survey)
>>
>> # creating data
>> listc <-
>> c("Denmark","Finland","Norway","Sweden","Denmark","Finland","Norway","Sweden")
>> listw <- c(1,2,1,1,1,1,1,1)
>> listd <- c(0,0,0,0,1000,1000,1000,2000)
>> listt <- c(750000,500000,900000,1900000,5000,5000,5000,10000)
>> list.cwdt <- c(listc, listw, listd, listt)
>> country <-
>> data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt)
>>
>> # running a frequency weighted regression to get the correct point
>> estimates for comparison
>> glm <- glm(deaths ~ country + offset(log(yrs_at_risk)),
>> weights=weight, data=country, family=poisson())
>> summary(glm)
>> regTermTest(glm, ~ country)
>>
>> # running survey weighted regression
>> svy <- svydesign(~0,,data=country, weight=~weight)
>> svyglm <- svyglm(deaths ~ country + offset(log(yrs_at_risk)),
>> design=svy, data=country, family=poisson())
>> summary(svyglm)
>> # point estimates are correct, but standard error is way too large
>> regTermTest(svyglm, ~ country)
>> # test indicates no country differences
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
> --
>   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907
>
>