[R] stepAIC and polynomial terms

Bill.Venables at csiro.au Bill.Venables at csiro.au
Mon Mar 17 05:01:41 CET 2008


There is absolutely no reason to remove age altogether.  Notice that
typically age and age^2 are highly correlated. To see this, consider 100
people with mean age 35 and 95% tolerance limite between 15 and 55:

> age <- rnorm(100, 35, 10)
> cor(age, age^2)
[1] 0.9898186

So if you use raw age and I(age^2) as predictors, it's really just the
luck of the draw which gets selected (usually), and they will do much
the same job when it comes to prediction, of course.  

So what are McCullagh and Nelder on about?  One way to look at it is as
a policy issue.  In a mathematical sense you would think that whether
you used age as the predictor or (age - 35) ("years away from mid-life")
should not make any difference, and if in you model selection procedure
it does make a difference, then something arbitrary is going on, and any
arbitrariness in this game is often a precursor of trouble to come.
Consider the correlations again:

> cor(age-35, (age-35)^2)
[1] -0.1302315

One way to *encourage* the linear term to be chosen ahead of the
quadratic term is, in fact, to mean correct the predictor:

sAge <- age - mean(age)

and use sAge and I(sAge^2) as your predictors.  I expect this will
favour the linear term over the quadratic and  you will be led to a
model that has no quadratic term, even if, in a strictly mathematical
sense, the starting models were entirely equivalent.  (Beware if you do
this, though, you make things difficult when it comes to prediction.)

You draw attention to a bit of a gap in the software, in my view.  In
variable selection with functions line stepAIC you would like to be able
to specify a set of marginality constraints (to use the McCullagh and
Nelder term) that you would like the model sifting process to respect,
in order to ensure invariance with respect to groups of transformations
that are natural to the problem.  In this case you would like to declare
that 1 is marginal to age which in turn is marginal to (age^2), to
ensure invariance with respect to the action of the location and scale
group, as seems natural.  Why should changing the origin and unit of
measurement have any consequences for the model selection process?
Notice that in the case of factors and interactions this happens
already: main effect terms will not be dropped if interactions involving
them are still present.  It's a similar argument.  The same feature,
ideally, should be available for other cases where marginality issues
are at stake, but doing that seems to be a tricky problem.  Using it
might be trickier still.  People would have to think about group
invariance properties and that's foreign to most people...

Bill Venables.



-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of caspar
Sent: Monday, 17 March 2008 11:50 AM
To: r-help at r-project.org
Subject: [R] stepAIC and polynomial terms

Dear all,
I have a question regarding the use of stepAIC and polynomial (quadratic
to be specific) terms in a binary logistic regression model. I read in
McCullagh and Nelder, (1989, p 89) and as far as I remember from my
statistics cources, higher-degree polynomial effects should not be
included without the main effects. If I understand this correctly,
following a stepwise model selection based on AIC should not lead to a
model where the main effect of some continuous covariate is removed, but
the quadratic term is kept.  
The question is, should I keep the quadratic term (note, there are other
main effects that were retained following the stepwise algorithm) in the
final model or should I delete it as well and move on? Or should I
retain the main effect as well? 

To picture it, the initial model to which I called stepAIC is:

Call:  glm(formula = S ~ FR + Date * age + I(age^2), family =
logexposure(ExposureDays = DATA$int),      data = DATA)

and the final one:

Call:  glm(formula = S ~ FR + Date + I(age^2), family =
logexposure(ExposureDays = DATA$int),      data = DATA) 

Thanks very much in advance for your thoughts and suggestions,

Caspar

Caspar Hallmann 
MSc Student WUR 
The Netherlands  
	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list