[R] Comparing models in multiple regression and hierarchical linear regression

Wed Nov 8 17:54:33 CET 2006

      The questions you ask about the interactions in the model not 
making sense relates, I believe, to a multiple comparisons issue that is 
not adequately addressed by the stepAIC analysis you did.  To understand 
this, note first that you've got something close to 2^(2^5) possible 
models:  With 5 main effects, there area 2^5 = 32 possible subsets = 
"interaction" terms.  The function stepAIC tries to select from among 
models representing all possible subsets of these 2^5 "interaction" 
terms.  Just for clarity on this point, suppose we had 2 main effects, A 
and B, not 5 as you have.  Then there are 2^2 = 4 possible "interaction" 
terms:  1, A, B, and A:B.  Then stepAIC would want to select the best 
model from among the 2^4 = 16 models consisting of these 4 terms plus 
all possible combinations of them.  With 5 main effects, this set of 
possible models balloons to 2^32 = 4.3e9.  Some of them don't make sense 
by themselves, e.g., A:B without A and B.  However, I don't think that 
will reduce this 4.3e9 by more than an order of magnitude or so -- and 
maybe only a few percent.  If we ignore this reduction, then the 
Bonferroni correction suggests we multiply all your p values by 4.3e9.  
If you still have any that are less than, say, 0.05, then you could 
believe those.  Otherwise, I'd be skeptical.  A better way of handling 
this, I believe is Bayesian Model Averaging.  An R package BMA is 
supposed to address that, but I don't know if it will help you. 

      Other questions:  What are your 5 explanatory variables, NR, NS, 
PA, KINDERWR, and WM?  In particular, are they numbers representing at 
least an ordinal scale or are they categories?  If numbers, you've got 
possibilities for parabolic terms that you haven't even considered.  If 
categories, how many levels are there?  If some of them have large 
numbers of levels, you may want to consider those factors as random 
effects.  In that case, you need something to do 'mixed-effects' 
modeling.  For that, people use the 'nlme' or 'lme4' packages. 

      Have you tried to find a statistician with whom you could consult 
or even possibly collaborate on this? 

      Hope this helps. 
      Spencer Graves

Jenifer Larson-Hall wrote:
> I don’t know if this question properly belongs on this list, but I’ll ask it here because I’ve been using R to run linear regression models, and it is only in using R (after switching from using SPSS) that I have discovered the process of fitting a linear model. However, after reading Crowley (2002), Fox (2002), Verzani (2004), Dalgaard (2002) and of course searching the R-help archives I cannot find an answer to my question.
> 	I have 5 explanatory variables (NR, NS, PA, KINDERWR, WM) and one response variable (G1L1WR). A simple main effects model finds that only PA is statistically significant, and an anova comparison between a 5-variable main effects model and a 1-variable main effects model finds no difference between the models. So it is possible to simplify the model to just G1L1WR ~ PA. This leaves me with a residual standard error of 0.3026 on 35 degrees of freedom and an adjusted R2 of 0.552.
> 	I also decided, following Crawley’s (2002) advice, to create a maximal model, G1L1WR ~ NR*NS*PA*KINDERWR*WM. This full model is not a good fit, but a stepAIC through the model revealed the model which had a maximal fit:
>
> maximal.fit=lm(formula = G1L1WR ~ NR + KINDERWR + NS + WM + PA + NR:KINDERWR + NR:NS + KINDERWR:NS + NR:WM + KINDERWR:WM + NS:WM + NR:PA + + KINDERWR:PA + NS:PA + WM:PA + NR:KINDERWR:NS + NR:KINDERWR:WM + NR:NS:WM + KINDERWR:NS:WM + NR:NS:PA + KINDERWR:NS:PA + KINDERWR:WM:PA + NR:KINDERWR:NS:WM, data = lafrance.NoNA)
>
> All of the terms of this model have statistical t-tests, the residual standard error has gone down to 0.2102, and the adjusted R2 has increased to 0.7839. An anova shows a clear difference between the simplified model and the maximal fit model. My question is, should I really pick the maximal fit over the simple model when it is really so much harder to understand? I guess there’s really no easy answer to that, but if that’s so, then my question is—would there be anything wrong with me saying that sometimes you might value parsimony and ease of understanding over best fit? Because I don’t really know what the maximal fit model buys you. It seems unintelligible to me. All of the terms are involved in interactions to some extent, but there are 4-way interactions and 3-way interactions and 2-way interactions and I’m not sure even how to understand it. A nice tree model showed that at higher levels of PA, KINDERWR and NS affected scores. That I can understand, but that is not reflected in this model.
>
> 	An auxiliary question, probably easier to answer, is how could I do hierarchical linear regression? The authors knew that PA would be the largest contributor to the response variable because of previous research, and their research question was whether PA would contribute anything AFTER the other 4 variables had already eaten their piece of the response variable pie. I know how to do a hierarchical regression in SPSS, and want to show in parallel how to do this in R. I did search R-help archives and didn’t find quite anything that would just plain tell me how to do hierarchical linear regression.
>
> Thanks in advance for any help.
>
> Dr. Jenifer Larson-Hall
> Assistant Professor of Linguistics
> University of North Texas
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>