[R] How to avoid overfitting in gam(mgcv)

Frank E Harrell Jr f.harrell at vanderbilt.edu
Wed Oct 3 15:26:04 CEST 2007


Ariyo Kanno wrote:
> Sorry, let me fix 1 sentence.
> 
> "Here I try to mean by "overfitting" that GCV was significantly SMALLER
> than the mean square error of prediction of the validation data, which
> was randomly selected and not used for regression."
> 
>> Thank you for valuable advices.

If your test sample includes fewer than 10,000 cases and your signal to 
noise ratio is not large, your estimate of cross-validation accuracy may 
be unreliable.  Often 50-fold repeats of 10-fold cross-validation is 
required, without setting aside a single "test" sample.

Frank

>> I'm sorry Dr. N. Wood that by mistake I sent this reply firstly to
>> your personal e-mail address.
>>
>> I will use the "min.sp" argument when the data size is very small. I'd
>> like to know if there is any criteria for selecting "min.sp."
>>
>> I compared gamma=1.0 and 1.4, and I could see the smoothing effects of
>>  enhancing gamma by comparing edf and smoothing parameter. But it was
>> not enough to suppress the overfitting when data size was small.
>>
>> Here I try to mean by "overfitting" that GCV was significantly larger
>> than the mean square error of prediction of the validation data, which
>> was randomly selected and not used for regression.
>>
>> Best Wishes,
>> Ariyo
>>
>> 2007/10/3, Simon Wood <s.wood at bath.ac.uk>:
>>> On Wednesday 03 October 2007 10:49, Ariyo Kanno wrote:
>>>> I appreciate your quick reply.
>>>> I am using the model of the following structure :
>>>>
>>>> fit <- gam(y~x1+s(x2))
>>>>
>>>> ,where y, x1, and x2 are quantitative variables.
>>>> So the response distribution is assumed to be gaussian(default).
>>>>
>>>> Now I understand that the data size was too small.
>>> -- Well, the 10 end is definitely too small, but you can get quite reasonable
>>> estimates of a single smoothing parameter from 30+ gaussian data.
>>> -- You can force smoother models my either setting the smoothing parameter
>>> yourself using the `sp' argument to `gam', or by using the `min.sp' argument
>>> to set a lower bound on the smoothing parameter.
>>> -- I'm suprised that `gamma' had no effect - how high did you try?
>>>
>>> best,
>>> Simon
>>>
>>>
>>>
>>>> Thank you.
>>>>
>>>> Best Wishes,
>>>>
>>>> Ariyo
>>>>
>>>> 2007/10/3, Simon Wood <s.wood at bath.ac.uk>:
>>>>> What sort of model structure are you using? In particular what is the
>>>>> response distribution? For poisson and binomial then overfitting can be a
>>>>> sign of overdispersion and quasipoisson or quasibinomial may be better.
>>>>> Also I would not expect to get useful smoothing parameter estimates from
>>>>> 10 data!
>>>>>
>>>>> best,
>>>>> Simon
>>>>>
>>>>> On Wednesday 03 October 2007 06:55, ???? wrote:
>>>>>> Dear listers,
>>>>>>
>>>>>> I'm using gam(from mgcv) for semi-parametric regression on small and
>>>>>> noisy datasets(10 to 200
>>>>>> observations), and facing a problem of overfitting.
>>>>>>
>>>>>> According to the book(Simon N. Wood / Generalized Additive Models: An
>>>>>> Introduction with R), it is
>>>>>> suggested to avoid overfitting by inflating the effective degrees of
>>>>>> freedom in GCV evaluation with
>>>>>> increased "gamma" value(e.g. 1.4). But in my case, it didn't make a
>>>>>> significant change in the
>>>>>> results.
>>>>>>
>>>>>> The only way I've found to suppress overfitting is to set the basis
>>>>>> dimension "k" at very low values
>>>>>> (3 to 5). However, I don't think this is reasonable because knots
>>>>>> selection will then be an
>>>>>> important issue.
>>>>>>
>>>>>> Is there any other means to avoid overfitting when alalyzing small
>>>>>> datasets?
>>>>>>
>>>>>> Thank you for your help in advance,
>>>>>> Ariyo Kanno
>>>>>>
>>>>>> --
>>>>>> Ariyo Kanno
>>>>>> 1st-year doctor's degree student at
>>>>>> Institute of Environmental Studies,
>>>>>> The University of Tokyo
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html and provide commented,
>>>>>> minimal, self-contained, reproducible code.
>>>>> --
>>>>>
>>>>>> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK
>>>>>> +44 1225 386603  www.maths.bath.ac.uk/~sw283
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html and provide commented,
>>>>> minimal, self-contained, reproducible code.
>>> --
>>>> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK
>>>> +44 1225 386603  www.maths.bath.ac.uk/~sw283
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
> 
> 
> ------------------------------------------------------------------------
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University



More information about the R-help mailing list