[R] fixed effects following lmer and mcmcsamp - which to present?
Douglas Bates
bates at stat.wisc.edu
Tue Aug 8 17:45:56 CEST 2006
On 8/8/06, Henrik Parn <henrik.parn at bio.ntnu.no> wrote:
> Dear all,
>
> I am running a mixed model using lmer. In order to obtain CI of
> individual coefficients I use mcmcsamp. However, I need advice which
> values that are most appropriate to present in result section of a
> paper. I have not used mixed models and lmer so much before so my
> question is probably very naive. However, to avoid to much problems with
> journal editors and referees addicted to p-values, I would appreciate
> advice of which values of the output for the fixed factor that would be
> most appropriate to present in a result section, in order to convince
> them of the p-value free 'lmer-mcmcsamp'-approach!
>
> Using the example from the help page on lmer:
>
> fm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
>
> ...I obtain the following for 'Days':
>
>
> summary(fm1)
> ...
> Estimate Std. Error t value
>
> Days 10.4673 1.5458 6.771
>
>
> ...and from mcmcsamp:
>
> summary(mcmcsamp(fm1 , n = 10000))
>
> 1. Empirical mean and standard deviation for each variable,
> plus standard error of the mean:
>
> Mean SD Naive SE Time-series SE
> Days 10.4695 1.7354 0.017354 0.015921
>
> 2. Quantiles for each variable:
> 2.5% 25% 50% 75% 97.5%
> Days 7.0227 9.3395 10.4712 11.5719 13.957
>
>
>
> The standard way of presenting coefficients following a 'non-lmer'
> output is often (beta=..., SE=..., statistic=..., P=...). What would be
> the best equivalent in a 'lmer-mcmcsamp-context'? (beta=..., CI=...) is
> a good start I believe. But which beta? And what else?
>
> I assume that the a 95% CI in this case would be 7.0227-13.957 (please,
> do correct me I have completely misunderstood!). But which would be the
> corresponding beta? 10.4673?, 10.4695? 10.4712? Is the t-value worth
> presenting or is it 'useless' without corresponding degrees of freedom
> and P-value? If I present the mcmcsamp-CI, does it make sense to present
> any of the three SE obtained in the output above? BTW, I have no idea
> what Naive SE, Time-series SE means. Could not find much in help and
> pdfs to coda or Matrix, or in Google.
I would definitely report the REML value 10.4673 as the estimate.
This is a reproducible estimate - the other two are stochastic in that
you will get different values from different mcmcsamp runs. However,
the reproducibility is high. Notice that they all round to 10.47 and
two decimal places is quite enough precision for reporting this value
when you consider that a 95% confidence interval is [7.02,13.96].
Another way of evaluating a confidence interval using the coda package
is the HPDinterval function. (HPD stands for "Highest Posterior
Density") It returns the shortest interval with a 95% probability
content in the empirical distribution.
Here are values from a sample that I evaluated
> data(sleepstudy)
> fm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
> fs1 <- mcmcsamp(fm1, 10000)
> library(coda)
> summary(fs1)
...
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
Days 10.4810 1.6910 0.016910 0.015762
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
Days 7.0975 9.3971 10.4851 11.5751 13.810
> fixef(fm1)
(Intercept) Days
251.40510 10.46729
> HPDinterval(fs1)
lower upper
Days 7.038076 13.734493
attr(,"Probability")
[1] 0.95
If you would like to use a t-distribution to calculate a confidence
interval, I would argue that the degrees of freedom for the t should
be somewhere between n - p = 178 in this case (n = number of
observations, p = number of coefficients in the fixed effects or
rank(X) where X is the model matrix for the fixed effects) and n -
rank([Z:X]) = 144. (there are 36 random effects and 2 fixed effects
but the rank of [Z:X] = 36)
If we use the lower value of 144 we produce a confidence interval of
> 10.4673 + c(-1,1) * qt(0.975, df = 144) * 1.5458
[1] 7.41191 13.52269
Notice that this interval is contained in the HPD interval and in the
interval obtained from the 2.5% and 97.5% quantiles of the empirical
distribution. I attribute this to the fact that the standard errors
are calculated conditional on the variance of the random effects.
Thus the t-based confidence interval takes into account imprecision of
the estimate of \sigma^2 but it assumes that the variance of the
random effects is fixed at the estimated values. (That's not quite
true - it is the relative variance that is assumed fixed - but the
effect is the same.) The MCMC sample allows all the parameters to vary
and I would claim is therefore a better measure of the marginal
variability of this parameter.
However, as stated above, the HPD interval is stochastic. I would
create more than one sample to check the reproducibility of the
intervals. In this case intervals based on a chain of length 10000
are not wonderfully consistent
Days 6.9389037 13.851661
Days 6.9968306 13.768851
and I might go to chains of length 50000 to check further
Days 7.0880422 13.903068
Days 7.0758948 13.965146
For the benefit of editors of a biological journal I think I would
report both the interval derived from the t-distribution and a
representative interval from the MCMC samples to show that the MCMC
samples show variation in excess of that reported by the t interval.
I hope this helps.
--Doug
More information about the R-help
mailing list