[R] svy / weighted regression

Thomas Lumley tlumley at u.washington.edu
Tue Oct 13 15:52:51 CEST 2009


I think there is a much simpler explanation.

The survey design object has eight observations, two per country.   With a sample size of two per 
country it is hardly surprising that country-specific estimates are not very precise. The actual data has 
hundreds of thousands  of observations per country, so it will have more precise estimates.

Grouping the data doesn't make a difference for model-based glm estimation, where it is simply a 
computational convenience. It *does* make a difference for design-based estimation, because it 
changes the design.

          -thomas


On Tue, 13 Oct 2009, Laust wrote:

> Dear David,
>
> Thanks again for your input! I realize that I did a bad job of
> explaining this in my first email, but the setup is that in Finland
> persons who die are sampled with a different probability (1) from
> those who live (.5). This was done by the Finnish data protection
> authorities to protect individuals against identification. In the rest
> of the countries everyone is sampled with a probability of 1. The data
> that I am supplying to R is summarized data for each country
> stratified by case status. Another way of organizing the data would
> be:
>
> # creating data
> listc <- c("Denmark","Finland","Norway","Sweden")
> listw <- c(1,2,1,1)
> listd <- c(1000,1000,1000,2000)
> listt <- c(755000,505000,905000,1910000)
> list.cwdt <- c(listc, listw, listd, listt)
> country2 <- data.frame(country=listc,weight=listw,deaths=listd,time=listt)
>
> I hope that it is clearer now that for no value of the independent
> variable 'country' is the rate going to be zero. I think this was also
> not the case in my original example, but this was obscured by my poor
> communication- & R-skills. But if data is organized this way then
> sampling weight of 2 for Finland should only be applied to the
> time-variable that contains person years at risk and *not* to the
> number of deaths, which would complicate matters further. I would know
> how to get this to work in R or in any other statistical package.
> Perhaps it is - as Peter Dalgaard suggested - the estimation of the
> dispersion parameter by the survey package that is causing trouble,
> not the data example eo ipso. Or perhaps I am just using survey in a
> wrong way.
>
> Best
> Laust
>
> ****
> Post doc. Laust Mortensen, PhD
> Epidemiology Unit
> University of Southern Denmark
>
> On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>> I think you are missing the point. You have 4 zero death counts associated
>> with much higher person years of exposure followed by 4 death counts in the
>> thousands associated with lower degrees of exposures. It seems unlikely that
>> these are real data as there are not cohorts that would exhibit such lower
>> death-rates. So it appears that in setting up your test case, you have
>> created an impossibly unrealistic test problem.
>>
>> --
>> David
>>
>>
>> On Oct 12, 2009, at 9:12 AM, Laust wrote:
>>
>>> Dear Peter,
>>>
>>> Thanks for the input. The zero rates in some strata occurs because
>>> sampling depended on case status: In Finland only 50% of the non-cases
>>> were sampled, while all others were sampled with 100% probability.
>>>
>>> Best
>>> Laust
>>>
>>> On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard
>>> <p.dalgaard at biostat.ku.dk> wrote:
>>>>
>>>> Sorry, forgot to "reply all"...
>>>>
>>>> Laust wrote:
>>>>>
>>>>> Dear list,
>>>>>
>>>>> I am trying to set up a propensity-weighted regression using the
>>>>> survey package. Most of my population is sampled with a sampling
>>>>> probability of one (that is, I have the full population). However, for
>>>>> a subset of the data I have only a 50% sample of the full population.
>>>>> In previous work on the data, I analyzed these data using SAS and
>>>>> STATA. In those packages I used a propensity weight of 1/[sampling
>>>>> probability] in various generalized linear regression-procedures, but
>>>>> I am having trouble setting this up. I bet the solution is simple, but
>>>>> I’m a R newbie. Code to illustrate my problem below.
>>>>
>>>> Hi Laust,
>>>>
>>>> You probably need the package author to explain fully, but as far as I
>>>> can see, the crux is that a dispersion parameter is being used, based on
>>>> Pearson residuals, even in the Poisson case (i.e. you effectively get
>>>> the same result as with quasipoisson()).
>>>>
>>>> I don't know what the rationale is for this, but it is clear that with
>>>> your data, an estimated dispersion parameter is going to be large. E.g.
>>>> the data has both 0 cases in 750000 person-years and 1000 cases in 5000
>>>> person-years for Denmark, and in your model they are supposed to have
>>>> the same Poisson rate.
>>>>
>>>> summary.svyglm starts off with
>>>>
>>>>   est.disp <- TRUE
>>>>
>>>> and AFAICS there is no way it can get set to FALSE.  Knowing Thomas,
>>>> there is probably a perfectly good reason not to just set the dispersion
>>>> to 1, but I don't get it either...
>>>>
>>>>>
>>>>> Thanks
>>>>> Laust
>>>>>
>>>>> # loading survey
>>>>> library(survey)
>>>>>
>>>>> # creating data
>>>>> listc <-
>>>>>
>>>>> c("Denmark","Finland","Norway","Sweden","Denmark","Finland","Norway","Sweden")
>>>>> listw <- c(1,2,1,1,1,1,1,1)
>>>>> listd <- c(0,0,0,0,1000,1000,1000,2000)
>>>>> listt <- c(750000,500000,900000,1900000,5000,5000,5000,10000)
>>>>> list.cwdt <- c(listc, listw, listd, listt)
>>>>> country <-
>>>>> data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt)
>>>>>
>>>>> # running a frequency weighted regression to get the correct point
>>>>> estimates for comparison
>>>>> glm <- glm(deaths ~ country + offset(log(yrs_at_risk)),
>>>>> weights=weight, data=country, family=poisson())
>>>>> summary(glm)
>>>>> regTermTest(glm, ~ country)
>>>>>
>>>>> # running survey weighted regression
>>>>> svy <- svydesign(~0,,data=country, weight=~weight)
>>>>> svyglm <- svyglm(deaths ~ country + offset(log(yrs_at_risk)),
>>>>> design=svy, data=country, family=poisson())
>>>>> summary(svyglm)
>>>>> # point estimates are correct, but standard error is way too large
>>>>> regTermTest(svyglm, ~ country)
>>>>> # test indicates no country differences
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>> --
>>>>  O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>>>>  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>>>>  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
>>>> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907
>>>>
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> Heritage Laboratories
>> West Hartford, CT
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle




More information about the R-help mailing list