[R] model simplification using Crawley as a guide

Wed Jun 11 17:39:19 CEST 2008

And to follow FH and HW

What level of significance are you using? .05 is excessively liberal.
Are you adjusting your p-values for the number of possible models? Do
you realize the p-values for dropping a term, being selected as the
maximum of a set of p-values, do not follow their usual distributions?
How are you compensating for sample size, as a p-value's being
significant is a function of sample size?  How are you compensating for
the fact that the current model choice is dependent on the previous
model choices? How do you know your tree of model choices is the optimal
one?  Have you considered cross-validation?  Are you looking for a model
that true describes a phenomenon or a predictive model that can be used
for practical purposes?

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of hadley wickham
Sent: Wednesday, June 11, 2008 9:34 AM
To: Frank E Harrell Jr
Cc: r-help at r-project.org; ChCh
Subject: Re: [R] model simplification using Crawley as a guide

On Wed, Jun 11, 2008 at 6:42 AM, Frank E Harrell Jr
<f.harrell at vanderbilt.edu> wrote:
> ChCh wrote:
>>
>> Hello,
>>
>> I have consciously avoided using step() for model simplification in 
>> favour of manually updating the model by removing non-significant 
>> terms one at a time.  I'm using The R Book by M.J. Crawley as a 
>> guide. It comes as no surprise that my analysis does proceed as 
>> smoothly as does Crawley's and being a beginner, I'm struggling with
what to do next.
>> I have a model:
>>
>> lm(y~A * B * C)
>>
>> where A is a categorical variable with three levels and B and C are 
>> continuous covariates.
>>
>> Following Crawley, I execute the model, then use summary.aov() to 
>> identify non-significant terms.  I begin deleting non-significant 
>> interaction terms one at a time (using update).  After each update() 
>> statement, I use
>> anova(modelOld,modelNew) to contrast the previous model with the 
>> updated one.  After removing all the interaction terms, I'm left
with:
>>
>> lm(y~ A + B + C)
>>
>> again, using summary.aov() I identify A to be non-significant, so I 
>> remove it, leaving:
>>
>> lm(y~B + C) both of which are continuous variables
>>
>> Does it still make sense to use summary.aov() or should I use 
>> summary.lm() instead?  Has the analysis switched from an ANCOVA to a 
>> regression?  Both give different results so I'm uncertain which
summary to accept.
>>
>> Any help would be appreciated!
>>
>>
>
> What is the theoretical basis for removing insignificant terms?  How 
> will you compensate for this in the final analysis (e.g., how do you 
> unbias your estimate of sigma squared)?

And in a similar vein, where are your exploratory graphics?  How do you
know that there is a linear relationship between your response and your
predictors?  Are the distributional assumptions you are making
appropriate?

Hadley

--
http://had.co.nz/

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.