[R] large survey data set

Thomas Lumley tlumley at u.washington.edu
Thu Jun 27 22:11:47 CEST 2002

On Thu, 27 Jun 2002, Andrew Perrin wrote:

> The lm function (for linear modelling aka linear regression) includes
> case weights with a simple syntax:
> foo<-lm(dependent ~ indep + indep + ... ,
> 	data = <data object>,
> 	weights = <weight variable>)

Yes, but that isn't what he means by weights...

The standard regression weights are variance weights: a weight of 2
denotes an observation with half the variance of a weight of 1.

In survey sampling (and in related missing data and causal inference
models) you need probability weights: a weight of 2 means an observation
had half the chance of being sampled.  You get the same regression
coefficients (more or less) but quite different standard errors.

The `model-robust' sandwich variance estimators give about the right
standard errors (as long as the sampling fraction is small). These are
built in to the survival models, but not in most other software. They are
pretty easy to calculate but with a 20% sample they probably aren't going
to work well.


r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list