[R] Why do two different calls to mgcv::gamm give different p-values?

Mon Oct 1 08:15:06 CEST 2018

I have longitudinal data where a response y_{ij} has been measured
repeatedly for i=1,...,n individuals, j=1,...,m times for each individual.
Let x_{ij} be the age of individual i at her/his jth measurement. I wish to
see how the response varies with age, and have reason to believe that a
nonlinear fit is needed. I hence wish to model the relationship using an
additive mixed model of the form y_{ij} = f(x_{ij}) + b_{i} +
\epsilon_{ij}, where f() is some smooth function to be estimated, b_{i}
\sim N(0, \sigma_{b}^{2}) is the random effect for individual i,
\epsilon_{ij} \sim N(0, \sigma^{2}) is the residual.

Reading the documentation to the mgcv package, I see that such models can
be set up by calling the mgcv::gamm two different ways. One way shown to
set up such a model is with the call
b1 <- gamm(y ~ s(x), random = list(id =~ 1)),
where id is an indicator of the specific individual. In this way, the
random effect is specified in a list. The other way is to set up the random
effect with a smooth of the re:
b2 <- gamm(y ~ s(x) + s(id, bs = "re")).

As far as I can understand, these two setups should create the same
underlying model matrices. However, when running them on my data, the
p-values for the smooth terms, as well as their estimated degrees of
freedom, are different.

Below is a reproducible example on simulated data, which show that these
two types of specifying the models give different p-values and estimated
degrees of freedom. On my real data, these differences are sometimes even
more exaggerated.

My question is: Are not these two calls to gamm suppose to estimate the
same model, or have I misunderstood?

Here is a reproducible example:

library(mgcv)
set.seed(100)
n_sim <- 100
df <- data.frame(
  model1_pval = numeric(n_sim),
  model1_edf = numeric(n_sim),
  model2_pval = numeric(n_sim),
  model2_edf = numeric(n_sim)
)

# Number of observations
n <- 500
# Number of groups
ng <- 100

for(i in 1:n_sim){
  # Draw the fixed effect covariate
  x <- rnorm(n)
  # Set up the group
  id <- rep(1:ng, n/ng)
  # Draw the random effect
  z <- rnorm(ng)
  # Draw the response
  y <- 0.1 * x + 0.2 * x^3 + z[id] + rnorm(n)

  # Fit the two different models
  b1 <- gamm(y ~ s(x, k = 5), random = list(id =~ 1))
  b2 <- gamm(y ~ s(x, k = 5) + s(id, bs = "re"))

  df[i, "model1_pval"] <- anova(b1$gam)$s.pv
  df[i, "model1_edf"] <- anova(b1$gam)$edf
  df[i, "model2_pval"] <- anova(b2$gam)$s.pv[[1]]
  df[i, "model2_edf"] <- anova(b2$gam)$edf[[1]]
}

# Differences between model 1 and model 2 p values
mean(df$model1_pval)
#> [1] 6.790546e-21
mean(df$model2_pval)
#> [1] 3.090694e-14
max(abs(df$model1_pval - df$model2_pval))
#> [1] 2.770545e-12

# Differences between model1 and model 2 estimated degrees of freedom
mean(df$model1_edf)
#> [1] 3.829992
mean(df$model2_edf)
#> [1] 3.731438
max(abs(df$model1_edf - df$model2_edf))
#> [1] 0.6320281

Thanks in advance,
Øystein Sørensen

	[[alternative HTML version deleted]]