[R] Design-consistent variance estimate

Doran, Harold HDoran at air.org
Mon Aug 18 16:46:43 CEST 2008


It also turns out that in educational testing, it is rare to consider
the sampling design and to estimate design-consistent standard errors. I
appreciate your thoughts on this, Stas. As a result, I was able to bring
to my mind more transparency into what R's survey package as well as SAS
proc surveymeans are doing. I've copied some minimal latex code below.
My R code reflecting this latex replicates svymean() and the SAS
procedures exactly under all conditions that I have tested so far for a
1 stage cluster sample.

It clearly reduces to a more simple expression when cluster sizes are
equal.

My hat is off to sampling statisticians, this has got to be a lot of fun
for you :)

### LaTeX

\documentclass[12pt]{article}
\usepackage{bm,geometry}
\begin{document}

In this scenario, the appropriate procedure is to estimate
design-consistent standard errors. This is accomplished by first
defining the ratio estimator of the mean as:

\begin{equation}
f(Y) = \frac{Y}{N}
\end{equation}

\noindent where $Y$ is the total of the variable and $N$ is the
population size. Treating both $Y$ and $N$ as random variables, a
first-order taylor series expansion of the ratio estimator $f(Y)$ can be
used to derive the design-consistent variance estimator as:

\begin{equation}
var(f(Y))  = \left[\frac{\partial f(Y)}{\partial Y}, \frac{\partial
f(Y)}{\partial N}\right] 
\left [
\begin{array}{cc}
var(Y)     & cov(Y,N)\\
cov(Y,N)   & var(N)\\
\end{array}
\right]
\left[\frac{\partial f(Y)}{\partial Y}, \frac{\partial f(Y)}{\partial
N}\right]^T
\end{equation}

\noindent where

\begin{equation}
\left[\frac{\partial f(Y)}{\partial Y}\right] = \frac{1}{N}
\end{equation}

\begin{equation}
\left[\frac{\partial f(Y)}{\partial N}\right] = - \frac{Y}{N^2}
\end{equation}

\begin{equation}
var(Y) = \frac{k}{k-1} \sum_{j=1}^k(\hat{Y}_j-\hat{Y}_{..})^2
\end{equation}

\begin{equation}
\hat{Y}_j = \sum_{i=1}^{n_j}\hat{Y}_{j(i)}
\end{equation}

\begin{equation}
\hat{Y}_{..} = k^{-1} \sum_{j=1}^k \hat{Y}_j
\end{equation}

\begin{equation}
var(N) = \frac{k}{k-1} \sum_{j=1}^k(\hat{N}_j-\hat{N}_{..})^2
\end{equation}

\begin{equation}
\hat{N}_j = \sum_{i=1}^{n_j}\hat{N}_{j(i)}
\end{equation}

\begin{equation}
\hat{N}_{..} = k^{-1} \sum_{j=1}^k \hat{N}_j
\end{equation}

\begin{equation}
cov(Y,N) = \sum_{j=1}^k(\hat{Y}_j- \hat{Y}_{..}) (\hat{N}_j-
\hat{N}_{..}) \times \frac{k}{k-1}
\end{equation}

\noindent where $j$ indexes cluster $(1, 2, \ldots, k)$, $j(i)$ indexes
the $i$th member of cluster $j$, and $n_j$ is the total number of
members in cluster $j$. 

The estimate of the variance of $f(Y)$ is then taken as:

\begin{equation}
var(f(Y)) = \frac{N^2var(Y) - 2cov(Y,N)NY + var(N)Y^2 }{N^2}
\end{equation}

The standard error is then taken as:

\begin{equation}
se = \sqrt{var(f(Y))}
\end{equation}

\end{document}

> -----Original Message-----
> From: Stas Kolenikov [mailto:skolenik at gmail.com] 
> Sent: Monday, August 18, 2008 10:40 AM
> To: Doran, Harold
> Cc: r-help at r-project.org
> Subject: Re: [R] Design-consistent variance estimate
> 
> On 8/16/08, Doran, Harold <HDoran at air.org> wrote:
> > In terms of the "design" (which is a term used loosely) the schools 
> > were not randomly selected. They volunteered to participate 
> in a pilot study.
> 
> Oh, that's a next level of disaster, then! You may have to 
> work with treatment effect models, of which there are many: 
> propensity score matching, nearest neighbor matching, 
> instrumental variables, etc.
> Those methods require asymptotics in terms of number of 
> treatment units, which would be schools -- and I would 
> imagine those are numbered in dozens rather than thousands in 
> your study, so straightforward application of those methods 
> might be problematic...
> At least I would augment my analysis with propensity score weights:
> somehow estimate the (school level) probability of 
> participating in the study (I imagine you have the school 
> characteristics at hand for your complete universe of schools 
> -- principal's education level, # of computers per student, 
> fraction free/reduced price lunch, whatever...
> you probably know those better than I do :) ), and use 
> inverse of that probability as the probability weight. If the 
> selection was informative, you might see quite different 
> results in weighted and unweighted analysis.
> 
> > In Wolter (1985) he shows the variance of a cluster sample with a 
> > single strata and then extends that to the more general example. It 
> > turns out though in many educational assessment studies, the single 
> > stage cluster sample is a norm and not so rare.
> 
> I can see why. Thanks, I'll keep educational statistics 
> examples in mind for those kinds of designs!
> 
> --
> Stas Kolenikov, also found at http://stas.kolenikov.name 
> Small print: I use this email account for mailing lists only.
> 



More information about the R-help mailing list