[R] Binomial multi-level (hierarchical) modelling [partly stats question, not completely R related]

Mark Wardle mark at wardle.org
Sat Jul 21 22:43:06 CEST 2007


Dear all,

This question is partly statistics and partly R and I apologise in
advance for my (usual) verbosity! My data is a little more complicated
that this suggests, but essentially I have proportion data from
different studies (id), each from a specific country and region of the
World. I would like to examine the variables that affect the
proportion, but these factors are hierarchical. In case I am not
explaining the problem correctly, I have created a simple analysis
with simulated data below!

My question is: How does one perform logistic regression using
interdependent (hieraerchical) variables and decide how much each
factor contributes to the variability?

Originally I wanted to look at whether there was only a regional
difference between studies, and simply aggregated the data and
performed a standard contingency analysis. However, I wanted to make
sure I wasn't aggregating and losing valuable covariates (?Simpson's
paradox...). Hence I thought I would be better off using binomial
regression instead, just to check other factors are not possible
confounders:

# At present I have a data frame with nested groups with proportion
data (counts and totals)
# make up some equivalent simulated data:
d1 <- data.frame(
	id    = factor(seq(1,20)),
	count = round(c(rnorm(10, mean=10, sd=4),rnorm(10, mean=20, sd=2)),0),
	total = round(rnorm(20, mean=50, sd=5),0),
	country = c('UK','UK','Fr','Fr','Sp','Sp',rep('Other',4),
				rep('Japan',8),rep('China',2)),
	region = c(rep('Europe',10), rep('Asia',10))
)
d1$remain <- with(d1, total-count)
summary(d1)	# show what we've done

# current analysis - simple aggregate and perform assessment of contigency table
d2 <- aggregate(d1[,2:3], by=list(d1$region), FUN=sum )
d2$remain <- with(d2, total-count)
d2.f <- fisher.test(matrix(c(d2$count, d2$remain), ncol=2))
print(d2.f)		# highly significant difference between europe and asia

# I liked the result of this - highly significant for both simulated
and real data, and easy to interpret for a stats-idiot like me!

# a better (?) analysis?
m1 <- glm(cbind(count, remain) ~ region, "binomial", data=d1)
summary(m1)

# with real data:
#    Call:
#    glm(formula = cbind(count, remain) ~ region, family = "binomial",
#        data = d1)
#
#    Deviance Residuals:
#           Min          1Q      Median          3Q         Max
#    -5.2427766  -2.4016616  -1.0373422  -0.0004486   4.9658610
#
#    Coefficients:
#                         Estimate Std. Error z value Pr(>|z|)
#    (Intercept)           -1.7402     0.1007 -17.287  < 2e-16 ***
#    regionAsia, Americas  -0.6781     0.2128  -3.187  0.00144 **
#    regionEurope          -0.8692     0.1665  -5.219 1.80e-07 ***
#    regionIndia          -16.6492   956.1380  -0.017  0.98611
#    regionOther            0.3107     0.2882   1.078  0.28096
#    ---
#    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
#    (Dispersion parameter for binomial family taken to be 1)
#
#        Null deviance: 187.02  on 18  degrees of freedom
#    Residual deviance: 141.37  on 14  degrees of freedom
#      (2 observations deleted due to missingness)
#    AIC: 209.28
#
#    Number of Fisher Scoring iterations: 14




# *** but how does one include nested parameters (ie. decide how much
of the variability is dependent on individual study, country and
region? especially given that these factors are inter-dependent -
hierarchical.)

m2 <- glm(cbind(count, remain) ~ region + country + id, "binomial", data=d1)

# with the fake data, nothing is significant - how does one decide
which effect to remove?

# and with REAL data:
#         Null deviance: 1.8702e+02  on 18  degrees of freedom
#     Residual deviance: 5.4476e-10  on  0  degrees of freedom
#       (2 observations deleted due to missingness)
#     AIC: 95.914
#
#     Number of Fisher Scoring iterations: 22

# Therefore, take out "id" as this clearly isn't the correct approach!

m3 <- glm(cbind(count, remain) ~ region + country, "binomial", data=d1)
summary(m3)

# and with REAL data:
#      Call:
#      glm(formula = cbind(count, remain) ~ region + country, family =
"binomial",
#          data = d1)
#
#      Deviance Residuals:
#                Basu:2000          Brusco:2004           Filla:2000
Hadjivassiliou:2003         Juvonen:2005           Leggo:1997
Maruyama:2002         Nagaoka:1999         Onodera:2000
Pujana:1999          Saleem:2000          Schols:1997
#               -6.322e-08           -2.978e-01            3.468e-01
         7.383e-02            0.000e+00           -9.928e-02
 3.328e+00           -1.225e+00           -2.354e+00
-1.718e-07           -2.002e-04            4.215e-08
#            Silveira:2002           Soong:2001          Storey:2000
       Takano:1998            Tang:2000      Warrenburg:2002
Watanabe:1998
#                0.000e+00            0.000e+00           -1.421e-07
         0.000e+00           -2.019e-04            0.000e+00
-3.590e+00
#
#      Coefficients: (3 not defined because of singularities)
#                                Estimate Std. Error z value Pr(>|z|)
#      (Intercept)                -2.1102     0.3744  -5.637 1.73e-08 ***
#      regionAsia, Americas       -0.3080     0.4187  -0.736  0.46193
#      regionEurope               -0.8342     0.7007  -1.191  0.23385
#      regionIndia               -18.1705  4285.1131  -0.004  0.99662
#      regionOther                 0.6807     0.4616   1.475  0.14026
#      countryChina              -20.0407  4247.6466  -0.005  0.99624
#      countryFinland             -0.9268     1.1712  -0.791  0.42877
#      countryGermany              1.6833     0.6530   2.578  0.00994 **
#      countryIndia               -1.1087     1.0863  -1.021  0.30747
#      countryItaly               -1.3562     0.7773  -1.745  0.08104 .
#      countryJapan                0.5989     0.3893   1.538  0.12396
#      countryJapan, USA               NA         NA      NA       NA
#      countryNetherlands          1.2081     0.6209   1.946  0.05167 .
#      countryPortugal / Brazil   -1.7095     1.1664  -1.466  0.14273
#      countrySpain               -1.3182     1.1683  -1.128  0.25918
#      countryTaiwan                   NA         NA      NA       NA
#      countryUK                       NA         NA      NA       NA
#      ---
#      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
#      (Dispersion parameter for binomial family taken to be 1)
#
#          Null deviance: 187.019  on 18  degrees of freedom
#      Residual deviance:  31.233  on  5  degrees of freedom
#        (2 observations deleted due to missingness)
#      AIC: 117.15
#
#      Number of Fisher Scoring iterations: 17

So I have evidence of marked overdispersion (if I read this
correctly!) with my real data.

It seems to me that this analysis needs a mixed model with
hierarchical terms and a binomial link function (?using nlme). Is this
possible? I tried using lme4's lmer() function like this:
library(lme4)
d1$prop <- with(d1, count/total)
me1 <- lmer(prop ~ country + (1|region) + (0+country|region),
family='binomial', weights=total, data=d1)

But I don't think I am using this properly and can't seem to determine
the variance due to the different hierarchical terms.

Can one use the interaction of country*region instead of using a mixed
model (where presumably country would be a fixed effect and region a
random effct?)
Any pointers much appreciated!

Many thanks,

Mark

P.S. while experimenting while creating the line:
> me1 <- lmer(prop ~ country + (1|region) + (0+country|region), family='binomial', weights=total, data=d1)

I accidentally pressed <RETURN> and issued the following command:

me1 <- lmer(prop ~ country + (0|region), family='binomial',
weights=total, data=d1)

which causes R to crash (and the application closes completely).

This is with R version

> sessionInfo()
R version 2.5.1 (2007-06-27)
i386-apple-darwin8.9.1

locale:
en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] "stats"     "graphics"  "grDevices" "datasets"  "utils"
"methods"   "base"

other attached packages:
        lme4       Matrix      lattice        RODBC
 "0.99875-4" "0.999375-0"    "0.15-11"      "1.2-1"


-- 
Dr. Mark Wardle
Clinical research fellow and specialist registrar, Neurology
Cardiff, UK



More information about the R-help mailing list