[R] model.matrix() and lm() for nested factors

Saul Sternberg saul at psych.upenn.edu
Sun Nov 21 21:34:17 CET 2010


To R-help list:

I would like to use lm() and lmRob() to estimate the parameters of a
fixed-effects model that includes nested factors with unequal numbers
of levels, in some cases without also including the nesting factor in
the model.

When I specify options(contrasts=c("contr.sum","contr.poly")), the model
matrix generated by model.matrix() can be incorrect.  I realize that I can
create the desired matrix myself, and use lm.fit(), or lmRob.fit.compute()
but there are two problems with this: First, we are warned against
using those functions directly, and, second, whereas the models I want to
apply are the same for about 100 different data frames, each with about
500 observations, a different model matrix would have to be constructed
for each data frame, increasing the chance of error.

-------------------------------------------------------------------
Here is a toy example:

Dataframe diff.df:
   G  D  T2  
1 g1 d1 -60 
2 g1 d2 -50 
3 g1 d3 -40 
4 g2 d1  20  
5 g2 d2  40  
6 g2 d3  60  
7 g2 d4  80  

(G, D factors; T2 numeric)

After
        options(contrasts=c("contr.sum","contr.poly"))

neither of the following produces the desired model matrix:

        model.matrix(T2 ~ G + D%in%G, diff.df)
        model.matrix(T2 ~  D%in%G, diff.df)

For example, omitting row numbers, the second command produces:

Icpt   Dd1:Gg1 Dd2:Gg1 Dd3:Gg1 Dd4:Gg1 Dd1:Gg2 Dd2:Gg2 Dd3:Gg2 Dd4:Gg2
1       1       0       0       0       0       0       0       0   
1       0       1       0       0       0       0       0       0   
1       0       0       1       0       0       0       0       0   
1       0       0       0       0       1       0       0       0   
1       0       0       0       0       0       1       0       0   
1       0       0       0       0       0       0       1       0   
1       0       0       0       0       0       0       0       1   

whereas the correct matrix is:

Icpt Gg1:D1 Gg1:D2 Gg2:D1 Gg2:D2 Gg2:D3
1      1      0      0      0      0
1      0      1      0      0      0
1     -1     -1      0      0      0
1      0      0      1      0      0
1      0      0      0      1      0
1      0      0      0      0      1
1      0      0     -1     -1     -1

-------------------------------------------------------------------
Other behavior of model.matrix() with nested factors in tiny examples:

When the nested factors had unequal numbers of levels and the nesting
        factor was not included in the model, as above, the matrix
        generated for contr.treatment was also wrong.

When the nested factors had unequal numbers of levels and the nesting
        factor was included in the model, the matrix for contr.sum was
        wrong, while the matrix for contr.treatment was wrong only in
        having an extra column of zeros.

When the nested factors had equal numbers of levels and the nesting factor
        was not included in the model, the correct matrix was generated
        for neither contr.sum nor contr.treatment.

When the nested factors had equal numbers of levels and the nesting factor
        was included in the model, the correct matrix was generated for
        both contr.sum and contr.treatment.
-------------------------------------------------------------------
My questions:

(1) If there is a way to use lm() and lmRob() in such cases so that they
generate the correct contrast matrices (and hence the desired parameter
estimates), what is it?

(2) If there is no way to do this, is the best alternative for the user to
create the desired model matrices "by hand" and provide them as arguments
to lim.fit() and lmRob.fit.compute()?

(3) If one uses lm.fit() and lmRob.fit.compute() directly in this way,
then, given that one is warned against doing so, what are the dangers?

Many thanks.

Saul Sternberg 
Psychology
University of Pennsylvania

----------------------------------------------------------------
R%%>sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
[5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
[9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base     

loaded via a namespace (and not attached):
[1] tools_2.12.0



More information about the R-help mailing list