[R] Interpreting model matrix columns when using contr.sum

Sun Jan 25 18:23:14 CET 2009

Many thanks to both Drs. Bates and Fox for the help!

I also figured out yesterday what Dr. Fox just said regarding the
interpretations of those coefficients for a balanced design. Thanks
Dr. Bates for the suggestion of using solve(cbind(1, contr.sum(4))) to
sort out the factor level effects. Model validation is very important,
but interpreting those coefficients, at least in the case of balanced
designs, also provides some insights about various effects for the
people working in the field.

Gang

On Sun, Jan 25, 2009 at 11:25 AM, John Fox <jfox at mcmaster.ca> wrote:
> Dear Doug and Gang Chen,
>
> With balanced data and sum-to-zero contrasts, the intercept is indeed the
> general mean of the response; the coefficient of a1 is the mean of the
> response in category a1 minus the general mean; the coefficient of a1:b1 is
> the mean of the response in cell a1, b1 minus the general mean and the
> coefficients of a1 and b1; etc. For unbalanced data (and balanced data) the
> intercept is the mean of the cell means; the coefficient of a1 is the mean
> of cell means at level a1 minus the intercept; etc. Whether all this is of
> interest is another question, since a simple graph of cell means tells a
> more digestible story about the data.
>
> Regards,
>  John
>
> ------------------------------
> John Fox, Professor
> Department of Sociology
> McMaster University
> Hamilton, Ontario, Canada
> web: socserv.mcmaster.ca/jfox
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> On
>> Behalf Of Douglas Bates
>> Sent: January-25-09 10:49 AM
>> To: Gang Chen
>> Cc: R-help
>> Subject: Re: [R] Interpreting model matrix columns when using contr.sum
>>
>> On Fri, Jan 23, 2009 at 4:58 PM, Gang Chen <gangchen6 at gmail.com> wrote:
>> > With the following example using contr.sum for both factors,
>> >
>> >> dd <- data.frame(a = gl(3,4), b = gl(4,1,12))     # balanced 2-way
>> >> model.matrix(~ a * b, dd, contrasts = list(a="contr.sum",
> b="contr.sum"))
>> >
>> >   (Intercept) a1 a2 b1 b2 b3 a1:b1 a2:b1 a1:b2 a2:b2 a1:b3 a2:b3
>> > 1            1  1  0  1  0  0     1     0     0     0     0     0
>> > 2            1  1  0  0  1  0     0     0     1     0     0     0
>> > 3            1  1  0  0  0  1     0     0     0     0     1     0
>> > 4            1  1  0 -1 -1 -1    -1     0    -1     0    -1     0
>> > 5            1  0  1  1  0  0     0     1     0     0     0     0
>> > 6            1  0  1  0  1  0     0     0     0     1     0     0
>> > 7            1  0  1  0  0  1     0     0     0     0     0     1
>> > 8            1  0  1 -1 -1 -1     0    -1     0    -1     0    -1
>> > 9            1 -1 -1  1  0  0    -1    -1     0     0     0     0
>> > 10           1 -1 -1  0  1  0     0     0    -1    -1     0     0
>> > 11           1 -1 -1  0  0  1     0     0     0     0    -1    -1
>> > 12           1 -1 -1 -1 -1 -1     1     1     1     1     1     1
>> > ...
>>
>> > I have two questions:
>>
>> > (1) I assume the 1st column (under intercept) is the overall mean, the
>> > 2rd column (under a1) is the difference between the 1st level of
>> > factor a and the overall mean, the 4th column (under b1) is the
>> > difference between the 1st level of factor b and the overall mean.
>>
>> > Is this interpretation correct?
>>
>> I don't think so and furthermore I don't see why the contrasts should
>> have an interpretation.  The contrasts are simply a parameterization
>> of the space spanned by the indicator columns of the levels of the
>> factors.  Interpretations as overall means, etc. are mostly a holdover
>> from antiquated concepts of how analysis of variance tables should be
>> evalated.
>>
>> If you want to determine the interpretation of particular coefficients
>> for the special case of a balanced design (which doesn't always mean a
>> resulting balanced data set - I remind my students that expecting a
>> balanced design to produce balanced data is contrary to Murphy's Law)
>> the easiest way of doing so is (I think this is right but I can
>> somehow manage to confuse myself on this with great ease) to calculate
>>
>> > contr.sum(3)
>>   [,1] [,2]
>> 1    1    0
>> 2    0    1
>> 3   -1   -1
>> > solve(cbind(1, contr.sum(3)))
>>               1          2          3
>> [1,]  0.3333333  0.3333333  0.3333333
>> [2,]  0.6666667 -0.3333333 -0.3333333
>> [3,] -0.3333333  0.6666667 -0.3333333
>> > solve(cbind(1, contr.sum(4)))
>>          1     2     3     4
>> [1,]  0.25  0.25  0.25  0.25
>> [2,]  0.75 -0.25 -0.25 -0.25
>> [3,] -0.25  0.75 -0.25 -0.25
>> [4,] -0.25 -0.25  0.75 -0.25
>>
>> That is, the first coefficient is the "overall mean" (but only for a
>> balanced data set), the second is a contrast of the first level with
>> the others, the third is a contrast of the second level with the
>> others and so on.
>>
>> > (2) I'm not so sure about those interaction columns. For example, what
>> > is a1:b1? Is it the 1st level of factor a at the 1st level of factor b
>> > versus the overall mean, or something more complicated?
>>
>> Well, at the risk of sounding trivial, a1:b1 is the product of the a1
>> and b1 columns.  You need a basis for a certain subspace and this
>> provides one.  I don't see why there must be interpretations of the
>> coefficients.