[R] mgcv: how select significant predictor vars when using gam(...select=TRUE) using automatic optimization

Thu Apr 25 11:45:02 CEST 2013

Juliet,

for you the diagnostic plots:

just to recall:
the first model was this:

     fit<-gam(target
~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=F) 
     > summary(fit)   

     Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
     (Intercept)   -4.724      7.462  -0.633    0.527
     Approximate significance of smooth terms:
            edf Ref.df      F p-value    
     s(mgs)    3.118  3.492  0.099   0.974    
     s(gsd)    6.377  7.044 15.596  <2e-16 ***
     s(mud)    8.837  8.971 18.832  <2e-16 ***
     s(ssCmax) 3.886  4.051  2.342   0.052 .  
     ---
     R-sq.(adj) =  0.403   Deviance explained = 40.6%
     REML score =  33186  Scale est. = 8.7812e+05  n = 4511

(I slightly shortened the output)

Also of interest:
Model error as  root mean squared error (RMSE):

     > sqrt(mean(residuals.gam(fit,type="response")^2))
     [1] 934.6647

Here are diagnostic plots:

<http://r.789695.n4.nabble.com/file/n4665370/screen-capture-1.png> 

<http://r.789695.n4.nabble.com/file/n4665370/screen-capture-2.png> 

Here Simons comment to this particular model from Apr 18, 2013; 5:25pm (see
above)

"The p-value computations are based on 
the approximation that things are approximately normal on the linear 
predictor scale, but actually they are no where close to normal in this 
case, which is why the p-values look inconsistent. The reason that the 
approximate normality assumption doesn't hold is that the model is quite 
a poor fit. If you take a look at gam.check(fit) you'll see that the 
constant variance assumption of quasi(link=log) is violated quite badly, 
and the residual distribution is really quite odd (plot residuals 
against fitted as well). Also see plot(fit,pages=1,scale=0) - it shows 
ballooning confidence intervals and smooth estimates that are so low in 
places that they might as well be minus infinity (given log link) - 
clearly something is wrong with this model! "

Following Simons advice (quote):
"try Tweedie(p=1.5,link=log) as the family. Also the predictor 
variables are very skewed which is giving leverage problems, so I would 
transform them to give less skew. e.g. Something like "

     fit<-gam(target~s(log(mgs))+s(I(gsd^.5))+s(I(mud^.25))+s(log(ssCmax)), 
     + family=Tweedie(p=1.6,link=log),data=wspe1,method="REML")
     > summary(fit)

     Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
     (Intercept)  4.02654    0.05231   76.97   <2e-16 ***
     Approximate significance of smooth terms:
                 edf Ref.df     F p-value    
     s(log(mgs))    6.067  7.292 12.58  <2e-16 ***
     s(I(gsd^0.5))  4.009  5.138 18.25  <2e-16 ***
     s(I(mud^0.25)) 7.210  8.240 58.54  <2e-16 ***
     s(log(ssCmax)) 8.407  8.764 74.87  <2e-16 ***
     R-sq.(adj) =  0.303   Deviance explained =   51%
     REML score =  14355  Scale est. = 27.702    n = 4511

(I slightly shortened the output)

RMSE did not improve:
     > sqrt(mean(residuals.gam(fit,type="response")^2))
     [1] 1009.268

diagnostic plots in the following

<http://r.789695.n4.nabble.com/file/n4665370/screen-capture-3.png> 

<http://r.789695.n4.nabble.com/file/n4665370/screen-capture-4.png> 

wich looks much better. 
The QQ-plot is closer to identity, 
the residuals are more evenly spread and much smaller.
Still, the correlation of response and fitted values seems pretty low

Hope this helps,

Jan

--
View this message in context: http://r.789695.n4.nabble.com/mgcv-how-select-significant-predictor-vars-when-using-gam-select-TRUE-using-automatic-optimization-tp4664510p4665370.html
Sent from the R help mailing list archive at Nabble.com.