[R] p-level in packages mgcv and gam

Wed Sep 28 17:33:27 CEST 2005

On Wed, 28 Sep 2005, Denis Chabot wrote:

> I only got one reply to my message:
>
>> No, this won't work.  The problem is the usual one with model
>> selection: the p-value is calculated as if the df had been fixed,
>> when really it was estimated.
>>
>> It is likely to be quite hard to get an honest p-value out of
>> something that does adaptive smoothing.
>>
>>     -thomas
>
> I do not understand this: it seems that a lot of people chose df=4
> for no particular reason, but p-levels are correct. If instead I
> choose df=8 because a previous model has estimated this to be an
> optimal df, P-levels are no good because df are estimated?

Yes. I know this sounds strange initially, but it really does make sense 
if you think about it carefully.

Suppose that Alice and Bob are kyphosis researchers, and that Alice always 
chooses 4df for smoothing Age.  We would all agree that her p-values are 
correct [in fact we wouldn't, but that is a separate issue]

Bob, on the other hand, chooses the amount of smoothing depending on the 
data. When a 4 df smooth fits best he ends up with the same model as Alice 
and the same p-value.  When some other df fits best he ends up with a 
different model and a *smaller* p-value than Alice.

In particular, this is still true under the null hypothesis that Age has 
no effect [If Alice and Bob are interested in p-values, the null 
hypothesis must be plausible.]

If Bob's p-values are always less than or equal to Alice's p-values under 
the null hypothesis, and Alice's p-values are less than 0.05 5% of the 
time, then Bob's p-values are less than 0.05 more than 5% of the time.

 	-thomas

> Furthermore, shouldn't packages gam and mgcv give similar results
> when the same data and df are used? I tried this:
>
> library(gam)
> data(kyphosis)
> kyp1 <- gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis)
> kyp2 <- gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis)
> kyp3 <- gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis)
> anova.gam(kyp1)
> anova.gam(kyp2)
> anova.gam(kyp3)
>
> detach(package:gam)
> library(mgcv)
> kyp4 <- gam(Kyphosis ~ s(Age, k=4, fx=T),  family=binomial,
> data=kyphosis)
> kyp5 <- gam(Kyphosis ~ s(Number, k=4, fx=T),  family=binomial,
> data=kyphosis)
> kyp6 <- gam(Kyphosis ~ s(Start, k=4, fx=T),  family=binomial,
> data=kyphosis)
> anova.gam(kyp4)
> anova.gam(kyp5)
> anova.gam(kyp6)
>
>
> P levels for these models, by pair
>
> kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad)
> kyp2 vs kyp5: p= 0.445 and 0.03 (wow!)
> kyp3 vs kyp6: p= 0.053 and 0.008 (wow again)
>
> Also if you plot all these you find that the mgcv plots are smoother
> than the gam plots, even the same df are used all the time.
>
> I am really confused now!
>
> Denis
>
> Début du message réexpédié :
>
>>> De : Denis Chabot <chabotd at globetrotter.net>
>>> Date : 26 septembre 2005 12:25:04 HAE
>>> À : r-help at stat.math.ethz.ch
>>> Objet : p-level in packages mgcv and gam
>>>
>>>
>>> Hi,
>>>
>>> I am fairly new to GAM and started using package mgcv. I like the
>>> fact that optimal smoothing is automatically used (i.e. df are not
>>> determined a priori but calculated by the gam procedure).
>>>
>>> But the mgcv manual warns that p-level for the smooth can be
>>> underestimated when df are estimated by the model. Most of the
>>> time my p-levels are so small that even doubling them would not
>>> result in a value close to the P=0.05 threshold, but I have one
>>> case with P=0.033.
>>>
>>> I thought, probably naively, that running a second model with
>>> fixed df, using the value of df found in the first model. I could
>>> not achieve this with mgcv: its gam function does not seem to
>>> accept fractional values of df (in my case 8.377).
>>>
>>> So I used the gam package and fixed df to 8.377. The P-value I
>>> obtained was slightly larger than with mgcv (0.03655 instead of
>>> 0.03328), but it is still < 0.05.
>>>
>>> Was this a correct way to get around the "underestimated P-level"?
>>>
>>> Furthermore, although the gam.check function of the mgcv package
>>> suggests to me that the gaussian family (and identity link) are
>>> adequate for my data, I must say the instructions in R help for
>>> "family" and in Hastie, T. and Tibshirani, R. (1990) Generalized
>>> Additive Models are too technical for me. If someone knows a
>>> reference that explains how to choose model and link, i.e. what
>>> tests to run on your data before choosing, I would really
>>> appreciate it.
>>>
>>> Thanks in advance,
>>>
>>> Denis Chabot
>>>
>>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle