[R] Design-consistent variance estimate

Fri Aug 15 22:55:24 CEST 2008

On 8/15/08, Doran, Harold <HDoran at air.org> wrote:
>  1) In this linearization, I do treat N (population) size as a known
> constant. I thought that is what svymean() and SAS proc surveymeans did as
> well. So, this is a simple univariate expansion since I only take the
> derivative w.r.t to Y, the population total.

The sample size is the issue, and the sample size is the random
variable. In many surveys, the population size may not be known at
all. All that you will have is an estimator of that population size,
which is total[1] in my quasi-notation. Procedurally, the latter is
usually the sum of weights, and each weight is usually the inverse
probability of selection.

>  2) Yes, the cluster sizes do vary. I meant to mention this. But, I wasn't
> sure if this was an issue or not. You can see in my first example I add in
> the comment that the data are balanced. That is because I created a second
> example (but didn't include it in this email) where I created an unbalanced
> data set where the cluster sizes vary. But, my code and svymeans() gave the
> exact same output when I ran it on the unbalanced cases as well.

OK, I did not check the details, but that is strange. With unbalanced
panels, off the top of my head, the linearization estimator looks like
[STUFF] sum_j n_j (\bar Y_{\cdot j} - \bar Y_{\cdot \cdot})^2 where
[STUFF] will do the proper scaling, something like 1/N^2. That's not
your formula. It might coincide with what you've been using for some
pretty special case (like constant within cluster variance which is
probably what you assumed in your simulated data). Again, look up Korn
and Graubard's book, they have a good discussion of this estimator.

>  3) There are no weights with these data. The data I am working with are
> test scores from a state. Students are clustered within schools. Entire
> schools were chosen to participate in the assessment.

So let me restate that: you have complete schools that were sampled
randomly, right? That's a pretty rare form of design. That's actually
just a one stage cluster design which I thought only exist in
textbooks! You do have to take that into account, and that also
addresses the next issue:

>  4) I was thinking the finite population correction would not be needed in
> this case, but maybe I am wrong. But if I did add in the finite population
> correction, that would affect the variance of the total and I would get a
> different estimate than what svymeans or SAS proc means gives and that
> doesn't occur. As it stands now, my code, and the built in functions return
> the same variance of the total.

Your finite population correction, at least at the second level
(students within schools) is 100%: you don't have any variance at all
at the school level (at least design variance, see comment below). So
what's left is the SRS (or whatever your sampling scheme for schools
was... I would use a probability proportional to size sampling design
there) of schools. You need to compute the school averages, and treat
them as i.i.d. data. You can do it as is, or you can take your
original data and specify the design with 100% fpc. The total[1] is
still a random variable (= sample size*# schools in the population/#
schools in the sample, all of which are available to you), and the
variance estimator is still the first term of Taylor series
linearization. Korn & Graubard give a very telling exercise/problem
where they show that with heavily unbalanced panels, even if you
sample all of them, you can get results that are quite notably biased.
The first level fpc (1-# schools in sample/# schools in population) is
still due, as I am sure that's not a negligible number.

That's what the design paradigm prescribes you to do. You probably
won't like the idea of zero variance, and that naturally is
suspicious. What your intuition is telling you is that there is
measurement error, etc. Then what goes on in your head is that you
think about your results in terms of model, or superpopulation,
inference, which in this case amounts to ANOVA. On model vs.
design-based inference, read Binder and Roberts 2003
(http://www.citeulike.org/user/ctacmo/article/1036932).

Note that svymeans is by no means built-in though. SAS PROC SVYMEANS
is; Stata's -svy: mean- is, but in R, most of the stuff is
user-contributed :)). See if Tom Lumley has any comments about whether
his package supports 100% fpc :))

-- 
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.