[R] GAM: Overfitting

Simon Wood simon at stats.gla.ac.uk
Wed Dec 22 12:08:17 CET 2004


> I am analyzing particulate matter data (PM10) on a small data set (147
> observations).  I fitted a semi-parametric model and am worried about
> overfitting.  How can one check for model fit in GAM?

- Keeping a random subset of the data as a validation set,  fitting 
to the remaining data and then comparing the R^2/ proportion deviance explained 
on fit set and validation set is usually quite diagnostic. If the fit data 
are much better predicted than the validation data, then you probably have 
over-fitting. 

- If your response is treated as Poisson then scale parameter estimates 
<<1 are also diagnostic, but only if you are not expecting overdispersion, 
of course. 

- If you use gam from package mgcv then, by default, model 
effective degrees of freedom are estimated from your data by GCV or an 
approximation to AIC. mgcv::gam allows you to increase the penalty on each 
model degree of freedom in these criteria, via gam argument `gamma'. Some 
work by Kim and Gu (2004, J.Roy.Statist.Soc.B) suggests that gamma around 
1.4 can be a sensible choise for surpressing overfitting, without 
much of a degredation in MSE performance.
 

best,
Simon




More information about the R-help mailing list