[R] lm: order of dropped columns

Thu Jul 22 03:19:57 CEST 2010

Hi all,

Thanks Dennis, do greatly appreciate the detailed reply. That does
answer a significant part of what I was getting at.

Unfortunately, I did not ask my question very well. Apologize. Suppose I run

> dat <- data.frame(a = factor(rep(letters[1:3], each = 4)),
+                 b = factor(rep(rep(1:2, each = 2), 3)),
+                 c = rnorm(12), d=rnorm(12))
> dat$e<-0.999*dat$d

I am (intentionally) making e "almost" equal to d to trigger the rank
condition. Subsequently, I run

> summary(lm(c~a+b+d+e, data=dat))
Call:
lm(formula = c ~ a + b + d + e, data = dat)
Residuals:
    Min      1Q  Median      3Q     Max
-1.0726 -0.5578  0.1132  0.4296  1.3397
Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.2564     0.5288   0.485    0.643
ab            1.0253     0.7234   1.417    0.199
ac           -0.5434     0.6869  -0.791    0.455
b2           -0.5521     0.5513  -1.002    0.350
d            -0.4098     0.2851  -1.438    0.194
e                 NA         NA      NA       NA
Residual standard error: 0.91 on 7 degrees of freedom
Multiple R-squared: 0.5177, Adjusted R-squared: 0.2421
F-statistic: 1.878 on 4 and 7 DF,  p-value: 0.2191

R appropriately drops e from the design matrix to make X'X
nonsingular. I am wondering how R decides (what is the convention)
whether to drop d, or to drop e. Are terms added from left to right of
the formula, and hence d retained and e dropped?

More generally, if presented with a non-singular design matrix with
column names (x1, x2, x3, x4, x5), what algorithm is used by R to
figure out which columns to drop. Eg: suppose the design matrix with
five columns (x1, x2, x3, x4, x5) has rank 3, and x3 = a1x1 + a2x2,
and x5 = a4x4. Which 2 columns would R try to drop?

I am mostly asking these questions because I am in the process of
working on a package. I want to make sure that my treatment of
singular design matrices is, as far as possible, compatible with lm.

Thanks again!

Best,
Anirban

On Wed, Jul 21, 2010 at 1:41 PM, Dennis Murphy <djmuser at gmail.com> wrote:
> Hi:
>
> (1) lm() drops columns in a rank-deficient model matrix X to make X'X
> nonsingular - this is called a full-rank reparameterization of the linear
> model.
>
> (2) How many columns of X are dropped depends on its rank, which in turn
> depends on the number of constraints in the model matrix. This is related to
> the degrees of freedom associated with each term in the corresponding linear
> model.. The default contrasts are
>> options()$contrasts
>         unordered           ordered
> "contr.treatment"      "contr.poly"
>
> Other choices include contr.helmert() and contr.sum(). See the help page
> ?contrasts for further details. See also section 6.2 of Venables and
> Ripley's _Modern Applied Statistics with S, 4th ed._ for further information
> on the connection between the columns of the model matrix in ANOVA and the
> defined sets of contrasts.
>
> Under the default contrasts, the first column is dropped for the main effect
> terms. Here's a simple example of a balanced 2-way ANOVA with interaction:
>
> d <- data.frame(a = factor(rep(letters[1:3], each = 4)),
>                 b = factor(rep(rep(1:2, each = 2), 3)),
>                 c = rnorm(12))
>> d
>    a b           c
> 1  a 1 -0.77367688
> 2  a 1 -0.79069791
> 3  a 2  0.69257133
> 4  a 2  2.46788204
> 5  b 1  0.38892289
> 6  b 1 -0.03521033
> 7  b 2 -0.01071611
> 8  b 2 -0.74209425
> 9  c 1  1.36974281
> 10 c 1 -1.22775441
> 11 c 2  0.29621976
> 12 c 2  0.28208192
> m <- aov(c ~ a * b, data = d)
> model.matrix(m)
>    (Intercept) ab ac b2 ab:b2 ac:b2
> 1            1  0  0  0     0     0
> 2            1  0  0  0     0     0
> 3            1  0  0  1     0     0
> 4            1  0  0  1     0     0
> 5            1  1  0  0     0     0
> 6            1  1  0  0     0     0
> 7            1  1  0  1     1     0
> 8            1  1  0  1     1     0
> 9            1  0  1  0     0     0
> 10           1  0  1  0     0     0
> 11           1  0  1  1     0     1
> 12           1  0  1  1     0     1
> attr(,"assign")
> [1] 0 1 1 2 3 3
> attr(,"contrasts")
> attr(,"contrasts")$a
> [1] "contr.treatment"
>
> attr(,"contrasts")$b
> [1] "contr.treatment"
>
> Notice that the first column of each main effect is dropped, and that the
> interaction columns are the products of the retained a columns with the
> retained b columns. The attr() components verify that the contrast method
> used for this matrix is the default contr.treatment.
>
>> anova(m)
> Analysis of Variance Table
>
> Response: c
>           Df Sum Sq Mean Sq F value Pr(>F)
> a          2 0.5001 0.25003  0.2827 0.7633
> b          1 1.3700 1.36999  1.5489 0.2597
> a:b        2 4.5647 2.28235  2.5804 0.1554
> Residuals  6 5.3070 0.88450
>
> Examination of the degrees of freedom tells us that there are two
> independent contrasts for a, one independent contrast for b and two
> independent contrasts for the interaction of a and b, which are shown in
> model.matrix(m) above.
>
> If you want to reset the baselline factor level, see ?relevel. Also look
> into the contrast package on CRAN.
>
> HTH,
> Dennis
>
> On Wed, Jul 21, 2010 at 8:40 AM, Anirban Mukherjee <am253 at cornell.edu>
> wrote:
>>
>> Hi all,
>>
>> If presented with a singular design matrix, lm drops columns to make the
>> design matrix non-singular. What algorithm is used to select which (and
>> how
>> many) column(s) to drop? Particularly, given a factor, how does lm choose
>> levels of the factor to discard?
>>
>> Thanks for the help.
>>
>> Best,
>> Anirban
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>