[R] glm(weights) and standard errors

Steve Taylor steve.taylor at aut.ac.nz
Mon May 28 23:55:00 CEST 2012


Thanks Peter for your clarifications.
 
Yes, the definition I'm looking for is: 
 -  I have 0.1 observations identical to this one,
i.e. this row and nine others similar (but not identical) to it together represent a single observation.
 
> in lm/glm ... the weights are really only relative
This is the problem I would like to get around.
 
> do we get the extra variability of the variance right?
The Wood et al paper suggests modifications to the weights to adjust for the varying amount of missingness in covariates.
 
I know Thomas (we're both in Auckland) so I'll ask him about the survey package.

-----Original Message-----
From: peter dalgaard [mailto:pdalgd at gmail.com] 
Sent: Friday, 25 May 2012 9:37p
To: ilai
Cc: Steve Taylor; r-help at r-project.org
Subject: Re: [R] glm(weights) and standard errors

Weighting can be confusing: There are three standard forms of weighting which you need to be careful not to mix up, and I suspect that the imputation weights are really a 4th version. 

First, there is case (replication) vs. precision weighting. A weight of 10 means one of

- I have 10 observations identical to this one
- This observation has a variance of sigma^2/10 as if it were the average of 10 observations.

There are also sampling weights:

- For each observation like this, I have 10 similar observations in the population (and I want to estimate a population parameter like the national average income or the percentage of votes at a hypothetical general election). 

What R does in lm/glm is precision weights. Notice that when the variance is estimated from data, the weights are really only relative: if all observations are weighted equally (all 10, say), you get a 10-fold increase in the estimated sigma^2 and a tenfold decrease in the unscaled variance-covariance matrix. So the net result is that the standard errors are the same (but they won't be if the weights are unequal).

The three weighting schemes share the same formula for the estimates, but differ both in the estimated variance and df, and in the formula for the standard errors. 

Sampling weights are the domain of the survey package, but I don't think it does replication weights (someone called Thomas may chime in and educate me otherwise). I'm not quite sure, but I think you can get from a precision-weighted analysis to a case-weighted one just by adjusting the DF for error (changing the residual df to df+sum(w)-n, and sigma^2 proportionally).

Imputation weights look like the opposite of case weights: You give 10 observations when in fact you have only one. An educated guess would be that you could do something similar as for case weights -- in this case sum(w) will be much less than n, so you will decrease the residual rather than increase it. I get this nagging feeling that it might still not be quite right, though -- in the cases where the imputations actually differ, do we get the extra variability of the variance right? Or maybe we don't need to care. There is a literature on the subject....

On May 25, 2012, at 09:21 , ilai wrote:

> I'm confused (I bet David is too). First and last models are "the
> same", what do SE's have to do with anything ?
> 
> naive <- glm(extra ~ group, data=sleep)
> imputWrong <- glm(extra ~ group, data=sleep10)
> imput <- glm(extra ~ group, data=sleep10,weights=rep(0.1,nrow(sleep10)))
> lapply(list(naive,imputWrong,imput),anova)
> sapply(list(naive,imuptWrong,imput),function(x) vcov(x)[1,1]/vcov(x)[2,2])
> # or another way to see it  (adjust for the DF)
> coef(summary(naive))[2,2] - sqrt(198)/sqrt(18) * coef(summary(imput))[2,2]
> coef(summary(naive))[2,2] - sqrt(198)/sqrt(18) * coef(summary(imputWrong))[2,2]
> 
> Are you sure you are interpreting Wood et al. correctly ? (I haven't
> read it, this is not rhetorical)
> 
> On Wed, May 23, 2012 at 7:49 PM, Steve Taylor <steve.taylor at aut.ac.nz> wrote:
>> Re:
>> coef(summary(glm(extra ~ group, data=sleep[ rep(1:nrow(sleep), 10L), ] )))
>> 
>> Your (corrected) suggestion is the same as one of mine, and doesn't do what I'm looking for.
>> 
>> 
>> -----Original Message-----
>> From: David Winsemius [mailto:dwinsemius at comcast.net]
>> Sent: Tuesday, 22 May 2012 3:37p
>> To: Steve Taylor
>> Cc: r-help at r-project.org
>> Subject: Re: [R] glm(weights) and standard errors
>> 
>> 
>> On May 21, 2012, at 10:58 PM, Steve Taylor wrote:
>> 
>>> Is there a way to tell glm() that rows in the data represent a certain
>>> number of observations other than one?  Perhaps even fractional
>>> values?
>>> 
>>> Using the weights argument has no effect on the standard errors.
>>> Compare the following; is there a way to get the first and last models
>>> to produce the same results?
>>> 
>>> data(sleep)
>>> coef(summary(glm(extra ~ group, data=sleep))) coef(summary(glm(extra ~
>>> group, data=sleep,
>>> weights=rep(10L,nrow(sleep)))))
>> 
>> Here's a reasonably simple way to do it:
>> 
>> coef(summary(glm(extra ~ group, data=sleep[ rep(10L,nrow(sleep)), ] )))
>> 
>> 
>> --
>> David.
>> 
>>> sleep10 = sleep[rep(1:nrow(sleep),10),] coef(summary(glm(extra ~
>>> group, data=sleep10))) coef(summary(glm(extra ~ group, data=sleep10,
>>> weights=rep(0.1,nrow(sleep10)))))
>>> 
>>> My reason for asking is so that I can fit a model to a stacked
>>> multiple imputation data set, as suggested by:
>>> 
>>> Wood, A. M., White, I. R. and Royston, P. (2008), How should variable
>>> selection be performed with multiply imputed data?.
>>> Statist. Med., 27: 3227-3246. doi: 10.1002/sim.3177
>>> 
>>> Other suggestions would be most welcome.
>>> 
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list