[R] Interpreting model matrix columns when using contr.sum

Sun Jan 25 16:48:32 CET 2009

On Fri, Jan 23, 2009 at 4:58 PM, Gang Chen <gangchen6 at gmail.com> wrote:
> With the following example using contr.sum for both factors,
>
>> dd <- data.frame(a = gl(3,4), b = gl(4,1,12))     # balanced 2-way
>> model.matrix(~ a * b, dd, contrasts = list(a="contr.sum", b="contr.sum"))
>
>   (Intercept) a1 a2 b1 b2 b3 a1:b1 a2:b1 a1:b2 a2:b2 a1:b3 a2:b3
> 1            1  1  0  1  0  0     1     0     0     0     0     0
> 2            1  1  0  0  1  0     0     0     1     0     0     0
> 3            1  1  0  0  0  1     0     0     0     0     1     0
> 4            1  1  0 -1 -1 -1    -1     0    -1     0    -1     0
> 5            1  0  1  1  0  0     0     1     0     0     0     0
> 6            1  0  1  0  1  0     0     0     0     1     0     0
> 7            1  0  1  0  0  1     0     0     0     0     0     1
> 8            1  0  1 -1 -1 -1     0    -1     0    -1     0    -1
> 9            1 -1 -1  1  0  0    -1    -1     0     0     0     0
> 10           1 -1 -1  0  1  0     0     0    -1    -1     0     0
> 11           1 -1 -1  0  0  1     0     0     0     0    -1    -1
> 12           1 -1 -1 -1 -1 -1     1     1     1     1     1     1
> ...

> I have two questions:

> (1) I assume the 1st column (under intercept) is the overall mean, the
> 2rd column (under a1) is the difference between the 1st level of
> factor a and the overall mean, the 4th column (under b1) is the
> difference between the 1st level of factor b and the overall mean.

> Is this interpretation correct?

I don't think so and furthermore I don't see why the contrasts should
have an interpretation.  The contrasts are simply a parameterization
of the space spanned by the indicator columns of the levels of the
factors.  Interpretations as overall means, etc. are mostly a holdover
from antiquated concepts of how analysis of variance tables should be
evalated.

If you want to determine the interpretation of particular coefficients
for the special case of a balanced design (which doesn't always mean a
resulting balanced data set - I remind my students that expecting a
balanced design to produce balanced data is contrary to Murphy's Law)
the easiest way of doing so is (I think this is right but I can
somehow manage to confuse myself on this with great ease) to calculate

> contr.sum(3)
  [,1] [,2]
1    1    0
2    0    1
3   -1   -1
> solve(cbind(1, contr.sum(3)))
              1          2          3
[1,]  0.3333333  0.3333333  0.3333333
[2,]  0.6666667 -0.3333333 -0.3333333
[3,] -0.3333333  0.6666667 -0.3333333
> solve(cbind(1, contr.sum(4)))
         1     2     3     4
[1,]  0.25  0.25  0.25  0.25
[2,]  0.75 -0.25 -0.25 -0.25
[3,] -0.25  0.75 -0.25 -0.25
[4,] -0.25 -0.25  0.75 -0.25

That is, the first coefficient is the "overall mean" (but only for a
balanced data set), the second is a contrast of the first level with
the others, the third is a contrast of the second level with the
others and so on.

> (2) I'm not so sure about those interaction columns. For example, what
> is a1:b1? Is it the 1st level of factor a at the 1st level of factor b
> versus the overall mean, or something more complicated?

Well, at the risk of sounding trivial, a1:b1 is the product of the a1
and b1 columns.  You need a basis for a certain subspace and this
provides one.  I don't see why there must be interpretations of the
coefficients.