[R] Design-consistent variance estimate

Mon Aug 18 16:53:14 CEST 2008

Whoops, the final var estimator var(f(Y)) should have N^4 in the
denominator not N^2

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Doran, Harold
> Sent: Monday, August 18, 2008 10:47 AM
> To: Stas Kolenikov
> Cc: r-help at r-project.org
> Subject: Re: [R] Design-consistent variance estimate
> 
> It also turns out that in educational testing, it is rare to 
> consider the sampling design and to estimate 
> design-consistent standard errors. I appreciate your thoughts 
> on this, Stas. As a result, I was able to bring to my mind 
> more transparency into what R's survey package as well as SAS 
> proc surveymeans are doing. I've copied some minimal latex code below.
> My R code reflecting this latex replicates svymean() and the 
> SAS procedures exactly under all conditions that I have 
> tested so far for a
> 1 stage cluster sample.
> 
> It clearly reduces to a more simple expression when cluster 
> sizes are equal.
> 
> My hat is off to sampling statisticians, this has got to be a 
> lot of fun for you :)
> 
> ### LaTeX
> 
> \documentclass[12pt]{article}
> \usepackage{bm,geometry}
> \begin{document}
> 
> In this scenario, the appropriate procedure is to estimate 
> design-consistent standard errors. This is accomplished by 
> first defining the ratio estimator of the mean as:
> 
> \begin{equation}
> f(Y) = \frac{Y}{N}
> \end{equation}
> 
> \noindent where $Y$ is the total of the variable and $N$ is 
> the population size. Treating both $Y$ and $N$ as random 
> variables, a first-order taylor series expansion of the ratio 
> estimator $f(Y)$ can be used to derive the design-consistent 
> variance estimator as:
> 
> \begin{equation}
> var(f(Y))  = \left[\frac{\partial f(Y)}{\partial Y}, 
> \frac{\partial f(Y)}{\partial N}\right] \left [ \begin{array}{cc}
> var(Y)     & cov(Y,N)\\
> cov(Y,N)   & var(N)\\
> \end{array}
> \right]
> \left[\frac{\partial f(Y)}{\partial Y}, \frac{\partial 
> f(Y)}{\partial N}\right]^T \end{equation}
> 
> \noindent where
> 
> \begin{equation}
> \left[\frac{\partial f(Y)}{\partial Y}\right] = \frac{1}{N} 
> \end{equation}
> 
> \begin{equation}
> \left[\frac{\partial f(Y)}{\partial N}\right] = - 
> \frac{Y}{N^2} \end{equation}
> 
> \begin{equation}
> var(Y) = \frac{k}{k-1} \sum_{j=1}^k(\hat{Y}_j-\hat{Y}_{..})^2
> \end{equation}
> 
> \begin{equation}
> \hat{Y}_j = \sum_{i=1}^{n_j}\hat{Y}_{j(i)} \end{equation}
> 
> \begin{equation}
> \hat{Y}_{..} = k^{-1} \sum_{j=1}^k \hat{Y}_j \end{equation}
> 
> \begin{equation}
> var(N) = \frac{k}{k-1} \sum_{j=1}^k(\hat{N}_j-\hat{N}_{..})^2
> \end{equation}
> 
> \begin{equation}
> \hat{N}_j = \sum_{i=1}^{n_j}\hat{N}_{j(i)} \end{equation}
> 
> \begin{equation}
> \hat{N}_{..} = k^{-1} \sum_{j=1}^k \hat{N}_j \end{equation}
> 
> \begin{equation}
> cov(Y,N) = \sum_{j=1}^k(\hat{Y}_j- \hat{Y}_{..}) (\hat{N}_j-
> \hat{N}_{..}) \times \frac{k}{k-1}
> \end{equation}
> 
> \noindent where $j$ indexes cluster $(1, 2, \ldots, k)$, 
> $j(i)$ indexes the $i$th member of cluster $j$, and $n_j$ is 
> the total number of members in cluster $j$. 
> 
> The estimate of the variance of $f(Y)$ is then taken as:
> 
> \begin{equation}
> var(f(Y)) = \frac{N^2var(Y) - 2cov(Y,N)NY + var(N)Y^2 }{N^2} 
> \end{equation}
> 
> The standard error is then taken as:
> 
> \begin{equation}
> se = \sqrt{var(f(Y))}
> \end{equation}
> 
> \end{document}
> 
> > -----Original Message-----
> > From: Stas Kolenikov [mailto:skolenik at gmail.com]
> > Sent: Monday, August 18, 2008 10:40 AM
> > To: Doran, Harold
> > Cc: r-help at r-project.org
> > Subject: Re: [R] Design-consistent variance estimate
> > 
> > On 8/16/08, Doran, Harold <HDoran at air.org> wrote:
> > > In terms of the "design" (which is a term used loosely) 
> the schools 
> > > were not randomly selected. They volunteered to participate
> > in a pilot study.
> > 
> > Oh, that's a next level of disaster, then! You may have to 
> work with 
> > treatment effect models, of which there are many:
> > propensity score matching, nearest neighbor matching, instrumental 
> > variables, etc.
> > Those methods require asymptotics in terms of number of treatment 
> > units, which would be schools -- and I would imagine those are 
> > numbered in dozens rather than thousands in your study, so 
> > straightforward application of those methods might be problematic...
> > At least I would augment my analysis with propensity score weights:
> > somehow estimate the (school level) probability of participating in 
> > the study (I imagine you have the school characteristics at 
> hand for 
> > your complete universe of schools
> > -- principal's education level, # of computers per student, 
> fraction 
> > free/reduced price lunch, whatever...
> > you probably know those better than I do :) ), and use 
> inverse of that 
> > probability as the probability weight. If the selection was 
> > informative, you might see quite different results in weighted and 
> > unweighted analysis.
> > 
> > > In Wolter (1985) he shows the variance of a cluster sample with a 
> > > single strata and then extends that to the more general 
> example. It 
> > > turns out though in many educational assessment studies, 
> the single 
> > > stage cluster sample is a norm and not so rare.
> > 
> > I can see why. Thanks, I'll keep educational statistics examples in 
> > mind for those kinds of designs!
> > 
> > --
> > Stas Kolenikov, also found at http://stas.kolenikov.name 
> Small print: 
> > I use this email account for mailing lists only.
> > 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>