[R] Exactly Replicating Stata's Survey Data Confidence Intervals in R

Thomas Lumley tlumley at uw.edu
Tue Sep 25 03:26:31 CEST 2012


 In the first and third examples it looks as though confint(svymean())
matches the totals and svyciprop(method="logit") matches the
proportions, which is what Stata says
(http://www.stata.com/statalist/archive/2006-10/msg01127.html).  The
agreement isn't perfect for the counts in the first example, but since
Stata's default numeric type is single-precision, it's accurate
enough.

In the second example the issue is that svyciprop(method="logit")
doesn't take the design degrees of freedom into account, and Stata
does.

m<-svyglm(I(val==1)~1,family=quasibinomial,design=y)
expit<-function(eta) exp(eta)/(1+exp(eta))
expit(coef(m)+qt(0.975,99)*SE(m))

 (Intercept)
0.2971778536

expit(coef(m)-qt(0.975,99)*SE(m))

 (Intercept)
0.1287770109

I'll add the df= argument to svyciprop(method="logit"), and then it
should all match.

   -thomas


On Tue, Sep 25, 2012 at 5:56 AM, Anthony Damico <ajdamico at gmail.com> wrote:
> Hi Dr. Lumley, you're obviously correct about all of that.  Thank you for
> cluing me into it!  And sorry for overlooking that part of the
> documentation.
>
> I'm unfortunately still struggling with matching numbers exactly, and I
> foolishly provided a dataset without a weight variable - thinking there was
> only one issue preventing R from matching Stata precisely.  Adding the df=
> to the confint() call seems to have only solved half of the problem.
>
> If you (or anyone else) still has any energy to look at this issue, I've
> pasted three analysis examples with both Stata and R code that aim to
> calculate a survey-adjusted confidence interval.
>
> All three examples match coefficient and standard error values precisely,
> for both weighted counts and percents.  In the first example, the confidence
> intervals don't match for either the counts or the percents.  In the second,
> the CIs match for counts but not percents.  In the third, CIs match for
> both.  Because of this erratic behavior on relatively straightforward
> datasets, I'm worried that Stata is making some weird (non-reproducible)
> calculations that would obviously be outside of the scope of this list.
>
> Thanks!!
>
>
> # # # # # # # #
> # example 1 #
> # # # # # # # #
>
> # simple example of confidence intervals not matching between stata and R
> # this example doesn't match CIs for counts or percents
>
> options( digits = 10 )
> library(foreign)
> library(survey)
> x <- read.dta(
> "http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop.dta" )
> x$pw <- 6194/310
> y <- svydesign( ~1 , data = x , weights = ~pw )
>
> svytotal( ~factor( stype ) , y )
> confint( svytotal( ~factor( stype ) , y ) , df = degf( y ) )
>
> svymean( ~factor( stype ) , y )
> confint( svymean( ~factor( stype ) , y ) , df = degf( y ) )
>
>
>
>
>
> use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop,
> clear
> gen pw = 6194/310
> svyset [pweight=pw]
> svy: tabulate stype, count se ci format (%12.3f)
> svy: tabulate stype, percent se ci format (%12.3f)
>
>
> ----------------------------------------------------------
>     stype |      count          se          lb          ub
> ----------+-----------------------------------------------
>         E |  88334.428     710.843   86940.929   89727.927
>         H |  15085.386     514.508   14076.773   16094.000
>         M |  20340.296     582.814   19197.778   21482.814
>
>
> --------------------------------------------------------------
>     stype | percentages           se           lb           ub
> ----------+---------------------------------------------------
>         E |      71.376        0.574       70.236       72.488
>         H |      12.189        0.416       11.397       13.028
>         M |      16.435        0.471       15.533       17.379
>
>
>
>
> # # # # # # # #
> # example 2 #
> # # # # # # # #
>
>
>
> # simple example of confidence intervals not matching between stata and R
> # this example does match CIs for counts but does not match for percents
>
> options(digits=10)
> library(foreign)
> library(survey)
>
> # create a sample data frame with 100 records, containing values 1 - 5 and
> weights 1 or 2
> x <- data.frame( case = 1:100 , val = 1:5 , weight = 1:2 )
> # immediately save it externally to be read into stata for comparison
> write.dta( x , "c:/my directory/hundred record example.dta" )
>
> # create a simple survey design, with weights only
> y <- svydesign( ~1 , data = x , weights = ~weight )
>
> # the confidence interval for the weighted counts do match stata
> svytotal( ~I( val == 1 ) , y )
> confint( svytotal( ~I( val == 1 ) , y ) , df = degf( y ) )
>
> # generate the mean & standard error -- these two match precisely..
> svymean( ~I( val == 1 ) , y )
>
> # # # # # # # # # # #
>
> # ..but none of the confidence interval attempts match
> confint( svymean( ~I( val == 1 ) , y ) )
> confint( svymean( ~I( val == 1 ) , y ) , df = degf( y ) )
> svyciprop(~I( val == 1 ) , y, method="li" , df = degf( y ) )
> svyciprop(~I( val == 1 ) , y, method="lo", df = degf( y ) )
> svyciprop(~I( val == 1 ) , y, method="as", df = degf( y ) )
> svyciprop(~I( val == 1 ) , y, method="be", df = degf( y ) )
>
> # # # # # # # # # # #
>
>
> **************************************
>
> * run the R code first to create this dta file
> use c:\my directory\hundred record example.dta , clear
> svyset [pweight=weight]
> svy: tabulate val, count se ci
> svy: tabulate val, percent se ci
>
> Number of strata   =         1                  Number of obs      =
> 100
> Number of PSUs     =       100                  Population size    =
> 150
>                                                 Design df          =
> 99
>
> ----------------------------------------------------------
>       val |      count          se          lb          ub
> ----------+-----------------------------------------------
>         1 |         30       6.435       17.23       42.77
>         2 |         30       6.435       17.23       42.77
>         3 |         30       6.435       17.23       42.77
>         4 |         30       6.435       17.23       42.77
>         5 |         30       6.435       17.23       42.77
>
>
> Number of strata   =         1                  Number of obs      =
> 100
> Number of PSUs     =       100                  Population size    =
> 150
>                                                 Design df          =
> 99
>
> --------------------------------------------------------------
>       val | percentages           se           lb           ub
> ----------+---------------------------------------------------
>         1 |          20        4.238        12.88        29.72
>         2 |          20        4.238        12.88        29.72
>         3 |          20        4.238        12.88        29.72
>         4 |          20        4.238        12.88        29.72
>         5 |          20        4.238        12.88        29.72
>
>
>
>
>
> # # # # # # # #
> # example 3 #
> # # # # # # # #
>
>
> # this example matches CIs for both
>
> library(foreign)
> library(survey)
>
> x <- read.dta( "http://www.stata-press.com/data/r11/nhanes2f.dta" )
> y <- svydesign( ~1 , data = x , weights = ~finalwgt )
> svymean( ~sex , y )
> confint( svymean( ~sex , y ) , df = degf( y ) )
>
>              mean     SE
> sexMale   0.47958 0.0059
> sexFemale 0.52042 0.0059
>               2.5 %    97.5 %
> sexMale   0.4681058 0.4910512
> sexFemale 0.5089488 0.5318942
>
> **************************************
>
> use http://www.stata-press.com/data/r11/nhanes2f, clear
> svyset [pweight=finalwgt]
> svy: tabulate sex, percent se ci
> (running tabulate on estimation sample)
>
> Number of strata   =         1                 Number of obs      =
> 10337
> Number of PSUs     =     10337                 Population size    =
> 117023659
>                                                Design df          =
> 10336
>
> --------------------------------------------------------------
> 1=male,   |
> 2=female  | percentages           se           lb           ub
> ----------+---------------------------------------------------
>      Male |       47.96        .5853        46.81        49.11
>    Female |       52.04        .5853        50.89        53.19
>
>
>
>
>
>
> On Sun, Sep 23, 2012 at 4:30 PM, Thomas Lumley <tlumley at uw.edu> wrote:
>>
>> On Sat, Sep 22, 2012 at 2:51 AM, Anthony Damico <ajdamico at gmail.com>
>> wrote:
>>
>> > Survey: Mean estimation
>> >
>> > Number of strata =       1          Number of obs    =     183
>> > Number of PSUs   =      15          Population size  =  9235.4
>> >                                     Design df        =      14
>> >
>> > --------------------------------------------------------------
>> >              |             Linearized
>> >              |       Mean   Std. Err.     [95% Conf. Interval]
>> > -------------+------------------------------------------------
>> >         ell0 |   .0218579   .0226225     -.0266624    .0703783
>> > --------------------------------------------------------------
>>
>> This matches
>>
>> > svyciprop(~I(ell==0),dclus1,df=14,method="mean")
>>                              2.5%   97.5%
>> I(ell == 0)  0.0218579 -0.0266624 0.07038
>>
>> as does this
>>
>> > confint(svymean(~I(ell==0),dclus1),df=14)
>>                           2.5 %        97.5 %
>> I(ell == 0)FALSE  0.92962174505 1.02666240796
>> I(ell == 0)TRUE  -0.02666240796 0.07037825495
>>
>> The df= argument is not explicitly documented in ?svyciprop, but it is
>> in ?confint.svystat and ?svymean
>>
>> [I was slowed down a bit by the claim that the Stata intervals were
>> asymmetric, but in fact they aren't]
>>
>>    -thomas
>>
>> --
>> Thomas Lumley
>> Professor of Biostatistics
>> University of Auckland
>
>



-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland




More information about the R-help mailing list