[R] mgcv (bam) very large standard error difference between versions 1.7-11 and 1.7-17, bug?

Tue Jun 5 11:06:28 CEST 2012

Dear useRs, Simon,

@Simon: I'll send the data offline.
The command summary(...,freq=F) yields the same result as before. Why
did the default change from freq=F in version 1.7-13 to freq=T in
version 1.7-17? Especially since the original default appeared to be
better.

I noticed that some of my commands were slightly off in the R-example
of my previous e-mail (cut-paste errors...). They should be:

modelLMER <- lmer(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
SpIsMale + (1|Key), data=wrddst)
print(modelLMER,cor=F)

# using version 1.7-13, default = "REML"
modelBAMold <- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
SpIsMale + s(Key,bs="re"), data=wrddst)
summary(modelBAMold)

# using version 1.7-17, explicitly stating method="REML"
modelBAMnew <- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
SpIsMale + s(Key,bs="re"), data=wrddst, method="REML")
summary(modelBAMnew)
summary(modelBAMnew,freq=F) # same results as modelBAMold

# using version 1.7-17, default: fREML - contains a bug
modelBAMfREML <- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
SpIsMale + s(Key,bs="re"), data=wrddst)
summary(modelBAMfREML)

With kind regards,
Martijn Wieling

--
*******************************************
Martijn Wieling
http://www.martijnwieling.nl
wieling at gmail.com
+31(0)614108622
*******************************************
University of Groningen
http://www.rug.nl/staff/m.b.wieling
*******************************************

On Sun, Jun 3, 2012 at 7:45 PM, Martijn Wieling <wieling at gmail.com> wrote:
> Dear useRs,
>
> I've ran some additional analyses (see below), which strongly suggest
> the standard errors of the bam (and gam) function are much too low in
> mgcv version 1.7-17, at least when including an s(X,bs="re") term.
> Until this issue has been clarified, it's perhaps best to use an older
> version of mgcv (unfortunately, however, in earlier versions the
> p-value calculation of s(X,bs="re") is not correct). All analyses were
> conducted in R 2.15.0.
>
> My approach was the following: I created a mixed-effects regression
> model with a single random intercept and only linear predictors. In my
> view, the results using lmer (lme4) should be comparable to those of
> bam and gam (mgcv). This was the case when using an older version of
> mgcv (version 1.7-13), but this is not the case anymore in version
> 1.7-17. In version 1.7-17, the standard errors and p-values are much
> lower and very similar to those of a linear model (which does not take
> the random-effects structure into account). The R-code and results are
> shown below. (The results using gam are not shown, but show the same
> pattern.)
>
> Furthermore, note that the differences in standard errors become less
> severe (but still noticeable) when less data is involved (e.g., using
> only 500 rows as opposed to >100.000). Finally, when not including an
> s(X,bs="re") term, but another non-random-effect smooth, the standard
> errors do not appear to be structurally lower (only for some
> variables, but not by a great deal - see also below).
>
> With kind regards,
> Martijn Wieling
> University of Groningen
>
> #### lme4 model (most recent version of lme4)
> modelLMER <- lmer(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
> SpIsMale + (1|Key), data=wrddst)
> #                        Estimate Std. Error t value
> #SpYearBirth.z          -0.012084   0.004577  -2.640
> #IsAragon                0.138959   0.010040  13.840
> #SpIsMale               -0.003087   0.008290  -0.372
> #SpYearBirth.z:IsAragon  0.015429   0.010159   1.519
>
>
> #### mgcv 1.7-13, default (method = "REML") - almost identical to modelLMER
> modelBAMold <- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
> SpIsMale + s(Key,bs="re"), data=wrddst)
> #                        Estimate Std. Error t value Pr(>|t|)
> #SpYearBirth.z          -0.012084   0.004578  -2.640  0.00829 **
> #IsAragon                0.138959   0.010042  13.838  < 2e-16 ***
> #SpIsMale               -0.003087   0.008292  -0.372  0.70968
> #SpYearBirth.z:IsAragon  0.015429   0.010160   1.519  0.12886
>
>
> #### mgcv 1.7-17, method = "REML" - standard errors greatly reduced
> # (comparable to standard errors of LM without random intercept)
> modelBAMnew <- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
> SpIsMale + s(Key,bs="re"), data=wrddst); print(testje,cor=F)
> #                        Estimate Std. Error t value Pr(>|t|)
> #SpYearBirth.z          -0.012084   0.001159 -10.428  < 2e-16 ***
> #IsAragon                0.138959   0.002551  54.472  < 2e-16 ***
> #SpIsMale               -0.003087   0.002098  -1.471    0.141
> #SpYearBirth.z:IsAragon  0.015429   0.002587   5.965 2.45e-09 ***
>
> #### lm results, standard errors comparable to mgcv 1.7-17
> modelLM <- lm(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon + SpIsMale,
> data=wrddst)
> #                        Estimate Std. Error t value Pr(>|t|)
> #(Intercept)            -0.025779   0.001653 -15.595  < 2e-16 ***
> #SpYearBirth.z          -0.011906   0.001182 -10.070  < 2e-16 ***
> #IsAragon                0.139323   0.002603  53.531  < 2e-16 ***
> #SpIsMale               -0.003076   0.002140  -1.437    0.151
> #SpYearBirth.z:IsAragon  0.015252   0.002639   5.780 7.49e-09 ***
>
>
> #### mgcv 1.7-17, default (method = "fREML") - completely different
> from previous models
> modelBAMfREML <- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
> SpIsMale + s(Key,bs="re"), data=wrddst); print(testje,cor=F)
> #                        Estimate Std. Error t value Pr(>|t|)
> #(Intercept)            -0.025391   0.106897  -0.238    0.812
> #SpYearBirth.z          -0.012084   0.076300  -0.158    0.874
> #IsAragon                0.138959   0.166697   0.834    0.405
> #SpIsMale               -0.003087   0.138291  -0.022    0.982
> #SpYearBirth.z:IsAragon  0.015429   0.168260   0.092    0.927
> #
> #Approximate significance of smooth terms:
> #          edf Ref.df     F p-value
> #s(Key) -38.95    310 15.67  <2e-16 ***
>
>
> #### differences w.r.t. standard smooths
> #### mgcv version 1.7-13
> m2old <- bam(RefPMIdistMeanLog.c ~ s(GeoX,GeoY) +
> SpYearBirth.z*IsAragon + SpIsMale, data=wrddst, method="REML")
> ## RESULTS
> #Family: gaussian
> #Link function: identity
> #
> #Formula:
> #RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + SpYearBirth.z * IsAragon +
> #    SpIsMale
> #
> #Parametric coefficients:
> #                        Estimate Std. Error t value Pr(>|t|)
> #(Intercept)            -0.001386   0.004982  -0.278   0.7809
> #SpYearBirth.z          -0.012950   0.001167 -11.097  < 2e-16 ***
> #IsAragon                0.020532   0.023608   0.870   0.3845
> #SpIsMale               -0.004788   0.002219  -2.158   0.0309 *
> #SpYearBirth.z:IsAragon  0.015611   0.002600   6.005 1.92e-09 ***
> #---
> #Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #
> #Approximate significance of smooth terms:
> #               edf Ref.df     F p-value
> #s(GeoX,GeoY) 27.11  28.14 126.2  <2e-16 ***
> #---
> #Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #
> #R-sq.(adj) =  0.0555   Deviance explained = 5.58%
> #REML score =  39232  Scale est. = 0.11734   n = 112608
>
>
> #### mgcv version 1.7-17
> m2new <- bam(RefPMIdistMeanLog.c ~ s(GeoX,GeoY) +
> SpYearBirth.z*IsAragon + SpIsMale, data=wrddst, method="REML")
> #Family: gaussian
> #Link function: identity
> #
> #Formula:
> #RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + SpYearBirth.z * IsAragon +
> #    SpIsMale
> #
> #Parametric coefficients:
> #                        Estimate Std. Error t value Pr(>|t|)
> #(Intercept)            -0.001388   0.003938  -0.352   0.7245
> #SpYearBirth.z          -0.012950   0.001167 -11.098  < 2e-16 ***
> #IsAragon                0.020543   0.018055   1.138   0.2552
> #SpIsMale               -0.004788   0.002215  -2.161   0.0307 *
> #SpYearBirth.z:IsAragon  0.015611   0.002600   6.005 1.92e-09 ***
> #---
> #Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #
> #Approximate significance of smooth terms:
> #               edf Ref.df     F p-value
> #s(GeoX,GeoY) 27.11  28.14 126.2  <2e-16 ***
> #---
> #Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #
> #R-sq.(adj) =  0.0555   Deviance explained = 5.58%
> #REML score =  39232  Scale est. = 0.11734   n = 112608
>
>
> On Sat, Jun 2, 2012 at 6:25 PM, Martijn Wieling <wieling at gmail.com> wrote:
>> Dear useRs,
>>
>> I reran an analysis with bam (mgcv, version 1.7-17) originally
>> conducted using an older version of bam (mgcv, version 1.7-11) and
>> this resulted in the same estimates, but much lower standard errors
>> (in some cases 20 times as low) and lower p-values. This obviously
>> results in a larger set of significant predictors. Is this result
>> expected given the improvements in the new version? Or this a bug and
>> are the p-values of bam in mgcv 1.7-17 too low? The summaries of both
>> versions are shown below to enable a comparison.
>>
>> In addition, applying the default method="fREML" (mgcv version 1.7-17)
>> on the same dataset yields only non-significant results, while all
>> results are highly significant using method="REML". Furthermore, it
>> also results in large negative (e.g., -8757) edf values linked to
>> s(X,bs="RE") terms. Is this correct, or is this a bug? The summary of
>> the model using method="fREML" is also shown below.
>>
>> I hope someone can shed some light on this.
>>
>> With kind regards,
>> Martijn Wieling,
>> University of Groningen
>>
>> #################################
>> ### mgcv version 1.7-11
>> #################################
>>
>> Family: gaussian
>> Link function: identity
>>
>> Formula:
>> RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z + IsSemiwordOrDemonstrative +
>>    RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
>>    s(Word, bs = "re") + s(Key, bs = "re")
>>
>> Parametric coefficients:
>>                           Estimate Std. Error t value Pr(>|t|)
>> (Intercept)               -0.099757   0.020234  -4.930 8.23e-07 ***
>> RefVratio.z                0.105705   0.013328   7.931 2.19e-15 ***
>> IsSemiwordOrDemonstrative  0.289828   0.046413   6.245 4.27e-10 ***
>> RefSoundCnt.z              0.119981   0.021202   5.659 1.53e-08 ***
>> SpYearBirth.z             -0.011396   0.002407  -4.734 2.21e-06 ***
>> IsAragon                   0.055678   0.033137   1.680  0.09291 .
>> PopCntLog_residGeo.z      -0.006504   0.003279  -1.984  0.04731 *
>> SpYearBirth.z:IsAragon     0.015871   0.005459   2.907  0.00365 **
>> ---
>> Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>
>> Approximate significance of smooth terms:
>>                edf Ref.df      F p-value
>> s(GeoX,GeoY)  24.01  24.21  31.16  <2e-16 ***
>> s(Word)      352.29 347.00 501.57  <2e-16 ***
>> s(Key)       269.75 289.25  10.76  <2e-16 ***
>> ---
>> Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>
>> R-sq.(adj) =  0.693   Deviance explained = 69.4%
>> REML score = -22476  Scale est. = 0.038177  n = 112608
>>
>>
>> #################################
>> ### mgcv version 1.7-17, much lower p-values and standard errors than
>> version 1.7-11
>> #################################
>>
>> Family: gaussian
>> Link function: identity
>>
>> Formula:
>> RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z + IsSemiwordOrDemonstrative +
>>    RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
>>    s(Word, bs = "re") + s(Key, bs = "re")
>>
>> Parametric coefficients:
>>                            Estimate Std. Error t value Pr(>|t|)
>> (Intercept)               -0.0997566  0.0014139 -70.552  < 2e-16 ***
>> RefVratio.z                0.1057049  0.0006565 161.010  < 2e-16 ***
>> IsSemiwordOrDemonstrative  0.2898285  0.0020388 142.155  < 2e-16 ***
>> RefSoundCnt.z              0.1199813  0.0009381 127.905  < 2e-16 ***
>> SpYearBirth.z             -0.0113956  0.0006508 -17.509  < 2e-16 ***
>> IsAragon                   0.0556777  0.0057143   9.744  < 2e-16 ***
>> PopCntLog_residGeo.z      -0.0065037  0.0007938  -8.193 2.58e-16 ***
>> SpYearBirth.z:IsAragon     0.0158712  0.0014829  10.703  < 2e-16 ***
>> ---
>> Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>
>> Approximate significance of smooth terms:
>>                edf Ref.df       F p-value
>> s(GeoX,GeoY)  24.01  24.21   31.15  <2e-16 ***
>> s(Word)      352.29 347.00  587.26  <2e-16 ***
>> s(Key)       269.75 313.00 4246.76  <2e-16 ***
>> ---
>> Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>
>> R-sq.(adj) =  0.693   Deviance explained = 69.4%
>> REML score = -22476  Scale est. = 0.038177  n = 112608
>>
>>
>> #################################
>> ### mgcv version 1.7-17, default: method="fREML", all p-values
>> non-significant and negative edf's of s(X,bs="re")
>> #################################
>>
>> Family: gaussian
>> Link function: identity
>>
>> Formula:
>> RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z + IsSemiwordOrDemonstrative +
>>    RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
>>    s(Word, bs = "re") + s(Key, bs = "re")
>>
>> Parametric coefficients:
>>                           Estimate Std. Error t value Pr(>|t|)
>> (Intercept)               -0.099757   1.730235  -0.058    0.954
>> RefVratio.z                0.105705   1.145329   0.092    0.926
>> IsSemiwordOrDemonstrative  0.289828   4.167237   0.070    0.945
>> RefSoundCnt.z              0.119981   1.901158   0.063    0.950
>> SpYearBirth.z             -0.011396   0.034236  -0.333    0.739
>> IsAragon                   0.055678   0.298629   0.186    0.852
>> PopCntLog_residGeo.z      -0.006504   0.041981  -0.155    0.877
>> SpYearBirth.z:IsAragon     0.015871   0.077142   0.206    0.837
>>
>> Approximate significance of smooth terms:
>>               edf Ref.df       F p-value
>> s(GeoX,GeoY) -1376      1   7.823 0.00516 **
>> s(Word)      -8298    347 577.910 < 2e-16 ***
>> s(Key)       -1421    316  13.512 < 2e-16 ***
>> ---
>> Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
>>
>> R-sq.(adj) =  0.741   Deviance explained = 69.4%
>> fREML score = -22476  Scale est. = 0.038177  n = 112608