[R] {Spam?} Re: mgcv, testing gamm vs lme, which degrees of freedom?

Mon Jun 21 22:01:50 CEST 2010

Bert Gunter
Genentech Nonclinical Statistics

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> On Behalf Of Joris Meys
> Sent: Monday, June 21, 2010 12:09 PM
> To: Carlo Fezzi
> Cc: r-help at r-project.org
> Subject: Re: [R] {Spam?} Re: mgcv, testing gamm vs lme,which degrees of
> freedom?
> 
> Hi Carlo,
> 
> You should get the book of Simon Wood and read it thoroughly. It's all
> explained in there, but it will lead me too far to copy it all in a
> mail. In short : random effects are part of the error structure of the
> model, not of the model itself. They're added to correct the error on
> the parameters of the fixed model, and are inherent to the data
> structure and not to the hypotheses. Hence, you rarely test their
> significance.

1. This discussion probably belongs on the sig-mixed-models list.

2. Your claim is incorrect, I think. The structure of the random errors =
model covariance can be parameterized in various ways, and one can try to
test significance of nested parameterizations (for a particular fixed
effects parameterizaton). Whether you can do so meaningfully especially in
the gamm context --  is another issue, but if you check e.g. Bates and
Pinheiro, anova for different random effects parameterizations is advocated,
and is implemented in the anova.lme() nlme function. 

3. But I strongly endorse your suggestion to consult an authoritative
resource; I believe it inherently unreasonable (but alas not unusual) that
posters somehow expect brief explanations on this list to illuminate and
resolve complex statistical issues. 

-- Bert

> 
> Cheers
> Joris
> 
> 
> On Mon, Jun 21, 2010 at 6:54 PM, Carlo Fezzi <c.fezzi at uea.ac.uk> wrote:
> > Hi Joris (CC Simon),
> >
> > Thanks for your kind replies and for being so responsive.
> >
> > I think this post boils down to two main questions (which I feel are
> very
> > important for gams modelling):
> >
> > 1- Is it appropriate to use LR tests in "gamm" to test model reduction?
> > 2- If yes, which degrees of freedom should be used?
> >
> > I do not think we should always use the df from "model$lme". For
> example,
> > compare the two models (again my first example for the data generation):
> >
> > f1 <- gamm(y ~ s(x, k=2, bs="cr"), random = list(id=~1), method="ML" )
> > f2 <- gamm(y ~ s(x, k=10, bs="cr"), random = list(id=~1), method="ML" )
> >
> > The difference between the two models is in the random effects. Model
> "f2"
> > has, if I interpreet correctly the output, 7 random effects more than
> the
> > model "f1", but the fixed effects are the same. So the H0 = "the 7
> random
> > effect are not significant". In this case the (app.) likelihood ratio
> test
> > should have 7 df... is my interpretation correct?
> >
> > On the other hand, to compare the following models:
> >
> > f3 <- gamm(y ~ x + I(x^2), random = list(id=~1), method="ML" )
> > f2 <- gamm(y ~ s(x, k=10, bs="cr"), random = list(id=~1), method="ML" )
> >
> > Model "f3" has 1 more fixed effect than model "f2", but model "f2" has 7
> > more random effects... again, if I understand correctly the output. In
> > this case I don't know if we can do a LR test, the model are not
> strictly
> > nested I think...
> >
> > What do you think?
> >
> > Again many thanks,
> >
> > Carlo
> >
> >
> >
> >> I don't use an LR test for non-nested models, as I fail to formulate a
> >> sensible null hypothesis for such tests. Again, everything I write is
> >> a personal opinion, and inference in the case of these models is still
> >> subject of discussion to date. If you find a plausible way for
> >> explaining the result, by all means use the LR test.
> >>
> >> Personally, I'd go for the AIC / BIC, but these are based on the
> >> likelihood themselves. So in the case where the effective complexity
> >> of the model appears the same, they're completely equivalent to the
> >> likelihood. It's just the inference (i.e. the p-value) I don't trust.
> >> But then again, I'm a cautious statistician. If I'm not sure about a
> >> method, I'd rather don't use it and go with what I know. In my view,
> >> there is not one correct method for a particular problem and/or
> >> dataset. Every method makes assumptions and has shortcomings. Only if
> >> I know which ones, I can take them into account when interpreting the
> >> results.
> >>
> >> It also depends on the focus as well. If the focus is prediction, you
> >> might even want to consider testing whether the variance of the
> >> residuals differs significantly with a simple F-test. This indicates
> >> whether the predictive power differs significantly between the models.
> >> But these tests tend to get very sensitive when you have many
> >> datapoints, rendering them practically useless again.
> >>
> >> So in the end, it always boils down to interpretation.
> >>
> >> Cheers
> >> Joris
> >>
> >> On Fri, Jun 18, 2010 at 10:29 PM, Carlo Fezzi <c.fezzi at uea.ac.uk>
> wrote:
> >>> Thanks Joris,
> >>>
> >>> I understand your point regarding the need for the two models to be
> >>> nested. So, according to your in the example case the LR test is not
> >>> appropriate and the two model should be compared with other criteria
> >>> such
> >>> as AIC or BIC for example.
> >>>
> >>> On the other hand, Simon Wood indicated that such a LR test is
> >>> (approximately) correct in his previous reply... a am bit confused,
> >>> which
> >>> is the correct approach to test the two models? Is the LR test correct
> >>> only if the parametric model is linear in the x variables maybe? In
> this
> >>> case, which is the best appraoch to compare a "gamm" vs a "lme" with
> >>> quadratic specification?
> >>>
> >>> Best wishes,
> >>>
> >>> Carlo
> >>>
> >>>> Just realized something: You should take into account that the LR
> test
> >>>> is actually only valid for _nested_ models. Your models are not
> >>>> nested. Hence, you shouldn't use the anova function to compare them,
> >>>> and you shouldn't compare the df. In fact, if you're interested in
> the
> >>>> contribution of a term, then using anova to compare the model with
> >>>> that term and without that term gives you an answer on the hypothesis
> >>>> whether that term with spline contributes significantly to the model.
> >>>>
> >>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML")
> >>>>
> >>>>> f3 <- gamm(y ~ x, random = list(id=~1), method="ML" )
> >>>>
> >>>>> f4 <- gamm(y ~ 1, random = list(id=~1), method="ML" )
> >>>>
> >>>>> anova(f3$lme,f2$lme)
> >>>>        Model df AIC BIC logLik   Test L.Ratio p-value
> >>>> f3$lme     1  4 760 770   -376
> >>>> f2$lme     2  5 381 394   -186 1 vs 2     380  <.0001
> >>>>
> >>>>> anova(f4$lme,f2$lme)
> >>>>        Model df AIC BIC logLik   Test L.Ratio p-value
> >>>> f4$lme     1  3 945 953   -470
> >>>> f2$lme     2  5 381 394   -186 1 vs 2     568  <.0001
> >>>>
> >>>>> anova(f3$lme,f4$lme)
> >>>>        Model df AIC BIC logLik   Test L.Ratio p-value
> >>>> f3$lme     1  4 760 770   -376
> >>>> f4$lme     2  3 945 953   -470 1 vs 2     188  <.0001
> >>>>
> >>>> This is the correct application of a likelihood ratio test. You see
> >>>> that adding the spline increases the df with 1 compared to the linear
> >>>> model, as part of the spline gets into the random component. Notice
> as
> >>>> well that the interpretation of a test in case of a random component
> >>>> is not the same as in case of a fixed component. If I understood
> >>>> correctly, this LR test specifically says something over the effect
> of
> >>>> X, without being interested in the shape of the spline. The
> >>>> "significance of a spline" is a difficult concept anyway, as a spline
> >>>> can be seen as a form of local regression. It's exactly the use of
> the
> >>>> randomization that allows for a general hypothesis about the added
> >>>> value of the spline, without focusing on its actual shape. Hence the
> >>>> "freedom" connected to that actual shape should not be used in the df
> >>>> used to test the general hypothesis.
> >>>>
> >>>> Hope this makes sense someway...
> >>>>
> >>>> Cheers
> >>>> Joris
> >>>>
> >>>>
> >>>> On Fri, Jun 18, 2010 at 6:27 PM, Carlo Fezzi <c.fezzi at uea.ac.uk>
> wrote:
> >>>>> Dear Simon,
> >>>>>
> >>>>> thanks a lot for your prompt reply.
> >>>>>
> >>>>> Unfortunately I am still confused about which is the correct way to
> >>>>> test
> >>>>> the two models... as you point out: why in my example the two models
> >>>>> have
> >>>>> the same degrees of freedom?
> >>>>>
> >>>>> Intuitively it seems to me the gamm model is more flexible since, as
> I
> >>>>> understand also from you response, it should contain more random
> >>>>> effects
> >>>>> than the other model because some of the smooth function parameters
> >>>>> are
> >>>>> represented as such. This should not be taken into account when
> >>>>> testing
> >>>>> one model vs the other?
> >>>>>
> >>>>> Continuing with my example, the two models:
> >>>>>
> >>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML")
> >>>>> f3 <- gamm(y ~ x + I(x^2), random = list(id=~1), method="ML" )
> >>>>>
> >>>>> Can be tested with:
> >>>>>
> >>>>> anova(f3$lme,f2$lme)
> >>>>>
> >>>>> But why are the df the same? Model f2 appears to be more flexible
> and,
> >>>>> as
> >>>>> such, should have more (random) parameters. Should not a test of one
> >>>>> model
> >>>>> vs the other take this into account?
> >>>>>
> >>>>> Sorry if this may sound dull, many thanks for your help,
> >>>>>
> >>>>> Carlo
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Wednesday 16 June 2010 20:33, Carlo Fezzi wrote:
> >>>>>>> Dear all,
> >>>>>>>
> >>>>>>> I am using the "mgcv" package by Simon Wood to estimate an
> additive
> >>>>>>> mixed
> >>>>>>> model in which I assume normal distribution for the residuals. I
> >>>>>>> would
> >>>>>>> like to test this model vs a standard parametric mixed model, such
> >>>>>>> as
> >>>>>>> the
> >>>>>>> ones which are possible to estimate with "lme".
> >>>>>>>
> >>>>>>> Since the smoothing splines can be written as random effects, is
> it
> >>>>>>> correct to use an (approximate) likelihood ratio test for this?
> >>>>>> -- yes this is ok (subject to the usual caveats about testing on
> the
> >>>>>> boundary
> >>>>>> of the parameter space) but your 2 example models below will have
> >>>>>>  the
> >>>>>> same
> >>>>>> number of degrees of freedom!
> >>>>>>
> >>>>>>> If so,
> >>>>>>> which is the correct number of degrees of freedom?
> >>>>>> --- The edf from the lme object, if you are testing using the log
> >>>>>> likelihood
> >>>>>> returned by the  lme representation of the model.
> >>>>>>
> >>>>>>> Sometime the function
> >>>>>>> LogLik() seems to provide strange results regarding the number of
> >>>>>>> degrees
> >>>>>>> of freedom (df) for the gam, for instance in the example I copied
> >>>>>>> below
> >>>>>>> the df for the "gamm" are equal to the ones for the "lme", but the
> >>>>>>> summary(model.gam) seems to indicate a much higher edf for the
> gamm.
> >>>>>> --- the edf for the lme representation of the model counts only the
> >>>>>> fixed
> >>>>>> effects + the variance parameters (which includes smoothing
> >>>>>> parameters).
> >>>>>> Each
> >>>>>> smooth typically contributes only one or two fixed effect
> parameters,
> >>>>>> with
> >>>>>> the rest of the coefficients for the smooth treated as random
> >>>>>> effects.
> >>>>>>
> >>>>>> --- the edf for the gam representation of the same model differs in
> >>>>>> that
> >>>>>> it
> >>>>>> also counts the *effective* number of parameters used to represent
> >>>>>> each
> >>>>>> smooth: this includes contributions from all those coefficients
> that
> >>>>>> the
> >>>>>> lme
> >>>>>> representation treated as strictly random.
> >>>>>>
> >>>>>> best,
> >>>>>> Simon
> >>>>>>
> >>>>>>
> >>>>>>> I would be very grateful to anybody who could point out a
> solution,
> >>>>>>>
> >>>>>>> Best wishes,
> >>>>>>>
> >>>>>>> Carlo
> >>>>>>>
> >>>>>>> Example below:
> >>>>>>>
> >>>>>>> ----
> >>>>>>>
> >>>>>>> rm(list = ls())
> >>>>>>> library(mgcv)
> >>>>>>> library(nlme)
> >>>>>>>
> >>>>>>> set.seed(123)
> >>>>>>>
> >>>>>>> x  <- runif(100,1,10)                                # regressor
> >>>>>>> b0 <- rep(rnorm(10,mean=1,sd=2),each=10)     # random intercept
> >>>>>>> id <- rep(1:10, each=10)                     # identifier
> >>>>>>>
> >>>>>>> y <- b0 + x - 0.1 * x^3 + rnorm(100,0,1)  # dependent variable
> >>>>>>>
> >>>>>>> f1 <- lme(y ~ x + I(x^2), random = list(id=~1) , method="ML" )  #
> >>>>>>> lme
> >>>>>>> model
> >>>>>>>
> >>>>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML" )    # gamm
> >>>>>>>
> >>>>>>> ## same number of "df" according to logLik:
> >>>>>>> logLik(f1)
> >>>>>>> logLik(f2$lme)
> >>>>>>>
> >>>>>>> ## much higher edf according to summary:
> >>>>>>> summary(f2$gam)
> >>>>>>>
> >>>>>>> -----------
> >>>>>>>
> >>>>>>> ______________________________________________
> >>>>>>> R-help at r-project.org mailing list
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide
> >>>>>>> http://www.R-project.org/posting-guide.html and provide commented,
> >>>>>>> minimal,
> >>>>>>> self-contained, reproducible code.
> >>>>>>
> >>>>>> --
> >>>>>>> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2
> 7AY
> >>>>>>> UK
> >>>>>>> +44 1225 386603  www.maths.bath.ac.uk/~sw283
> >>>>>>
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help at r-project.org mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> >>>>> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Joris Meys
> >>>> Statistical consultant
> >>>>
> >>>> Ghent University
> >>>> Faculty of Bioscience Engineering
> >>>> Department of Applied mathematics, biometrics and process control
> >>>>
> >>>> tel : +32 9 264 59 87
> >>>> Joris.Meys at Ugent.be
> >>>> -------------------------------
> >>>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> >>>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> Joris Meys
> >> Statistical consultant
> >>
> >> Ghent University
> >> Faculty of Bioscience Engineering
> >> Department of Applied mathematics, biometrics and process control
> >>
> >> tel : +32 9 264 59 87
> >> Joris.Meys at Ugent.be
> >> -------------------------------
> >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> >>
> >
> >
> >
> 
> 
> 
> --
> Joris Meys
> Statistical consultant
> 
> Ghent University
> Faculty of Bioscience Engineering
> Department of Applied mathematics, biometrics and process control
> 
> tel : +32 9 264 59 87
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.