[R] why NA coefficients

Tue Nov 8 19:22:37 CET 2011

The cell mean mu_{12} is non-estimable because it has no data in the
cell. How can you estimate something that's not there (at least
without imputation :)? Every parametric function that involves mu_{12}
will also be non-estimable - in particular,  the interaction term and
the population marginal means . That's why you get the NA estimates
and the warning. All this follows from the linear model theory
described in, for example, Milliken and Johnson (1992), Analysis of
Messy Data, vol. 1, ch. 13.

Here's an example from Milliken and Johnson (1992) to illustrate:
          B1         B2       B3
T1      2, 6                   8, 6
T2        3          14      12, 9
T3        6           9

Assume a cell means model E(Y_{ijk}) = \mu_{ij}, where the cell means
are estimated by the cell averages.

>From M & J (p. 173, whence this example is taken):
"Whenever treatment combinations are missing, certain
hypotheses cannot be tested without making some
additional assumptions about the parameters in the model.
Hypotheses involving parameters corresponding to the
missing cells generally cannot be tested. For example,
for the data [above] it is not possible to estimate any
linear combinations (or to test any hypotheses) that
involve parameters \mu_{12} and \mu_{33} unless one
is willing to make some assumptions about them."

They continue:
"One common assumption is that there is no
interactions between the levels of T and the levels of B.
In our opinion, this assumption should not be made
without some supporting experimental evidence."

In other words, removing the interaction term makes the
non-estimability problem disappear, but it's a copout unless there is
some tangible scientific justification for an additive rather than an
interaction model.

For the above data, M & J note that it is not possible to estimate all
of the expected marginal means - in particular, one cannot estimate
the population marginal means $\bar{\mu}_{1.}$, $\bar{\mu}_{3.}$,
$\bar{\mu}_{.2}$ or $\bar{\mu}_{.3}$. OTOH, $\bar{\mu}_{2.}$ and
$\bar{\mu}_{.1}$ since these functions of the parameters involve terms
associated with the means of the missing cells. Moreover, any
hypotheses involving parametric functions that contain non-estimable
cell means are not testable. In this example, the test of equal row
population marginal means is not testable because $\bar{\mu}_{1.}$ and
$\bar{\mu}_{3.}$ are not estimable.

[Aside: if the term parametric function is not familiar, in this
context it refers to linear combinations of model parameters.  In the
M & J example, $\bar{\mu}_{1.} = \mu_{11} + \mu_{12} + \mu_{13}$ is a
parametric function.]

Hopefully this sheds some light on the situation.

Dennis

On Mon, Nov 7, 2011 at 10:17 PM, array chip <arrayprofile at yahoo.com> wrote:
> Hi Dennis,
> The cell mean mu_12 from the model involves the intercept and factor 2:
> Coefficients:
>                   (Intercept)                 factor(treat)2
> factor(treat)3
>                      0.429244
> 0.499982                       0.352971
>                factor(treat)4                 factor(treat)5
> factor(treat)6
>                     -0.204752
> 0.142042                       0.044155
>                factor(treat)7                 factor(group)2
> factor(treat)2:factor(group)2
>                     -0.007775
> -0.337907                      -0.208734
> factor(treat)3:factor(group)2  factor(treat)4:factor(group)2
> factor(treat)5:factor(group)2
>                     -0.195138
> 0.800029                       0.227514
> factor(treat)6:factor(group)2  factor(treat)7:factor(group)2
>                      0.331548                             NA
> So mu_12 = 0.429244-0.337907 = 0.091337. This can be verified by:
>> predict(fit,data.frame(list(treat=1,group=2)))
>          1
> 0.09133691
> Warning message:
> In predict.lm(fit, data.frame(list(treat = 1, group = 2))) :
>   prediction from a rank-deficient fit may be misleading
>
> But as you can see, it gave a warning about rank-deficient fit... why this
> is a rank-deficient fit?
> Because "treat 1_group 2" has no cases, so why it is still estimable while
> on the contrary, "treat 7_group 2" which has 2 cases is not?
> Thanks
> John
>
>
>
>
> ________________________________
> From: Dennis Murphy <djmuser at gmail.com>
> To: array chip <arrayprofile at yahoo.com>
> Sent: Monday, November 7, 2011 9:29 PM
> Subject: Re: [R] why NA coefficients
>
> Hi John:
>
> What is the estimate of the cell mean \mu_{12}? Which model effects
> involve that cell mean? With this data arrangement, the expected
> population marginal means of treatment 1 and group 2 are not estimable
> either, unless you're willing to assume a no-interaction model.
>
> Chapters 13 and 14 of Milliken and Johnson's Analysis of Messy Data
> (vol. 1) cover this topic in some detail, but it assumes you're
> familiar with the matrix form of a linear statistical model. Both
> chapters cover the two-way model with interaction - Ch.13 from the
> cell means model approach and Ch. 14 from the model effects approach.
> Because this was written in the mid 80s and republished in the early
> 90s, all the code used is in SAS.
>
> HTH,
> Dennis
>
> On Mon, Nov 7, 2011 at 7:07 PM, array chip <arrayprofile at yahoo.com> wrote:
>> Thanks David. The only category that has no cases is "treat 1-group 2":
>>
>>> with(test,table(treat,group))
>>      group
>> treat 1 2
>>     1 8 0
>>     2 1 5
>>     3 5 5
>>     4 7 3
>>     5 7 4
>>     6 3 3
>>     7 8 2
>>
>> But why the coefficient for "treat 7-group 2" is not estimable?
>>
>> Thanks
>>
>> John
>>
>>
>>
>>
>> ________________________________
>> From: David Winsemius <dwinsemius at comcast.net>
>>
>> Cc: "r-help at r-project.org" <r-help at r-project.org>
>> Sent: Monday, November 7, 2011 5:13 PM
>> Subject: Re: [R] why NA coefficients
>>
>>
>> On Nov 7, 2011, at 7:33 PM, array chip wrote:
>>
>>> Hi, I am trying to run ANOVA with an interaction term on 2 factors (treat
>>> has 7 levels, group has 2 levels). I found the coefficient for the last
>>> interaction term is always 0, see attached dataset and the code below:
>>>
>>>> test<-read.table("test.txt",sep='\t',header=T,row.names=NULL)
>>>> lm(y~factor(treat)*factor(group),test)
>>>
>>> Call:
>>> lm(formula = y ~ factor(treat) * factor(group), data = test)
>>>
>>> Coefficients:
>>>                   (Intercept)                 factor(treat)2
>>>    factor(treat)3
>>>                      0.429244                       0.499982
>>>          0.352971
>>>                factor(treat)4                 factor(treat)5
>>>    factor(treat)6
>>>                     -0.204752                       0.142042
>>>          0.044155
>>>                factor(treat)7                 factor(group)2
>>> factor(treat)2:factor(group)2
>>>                     -0.007775                      -0.337907
>>>         -0.208734
>>> factor(treat)3:factor(group)2  factor(treat)4:factor(group)2
>>> factor(treat)5:factor(group)2
>>>                     -0.195138                       0.800029
>>>          0.227514
>>> factor(treat)6:factor(group)2  factor(treat)7:factor(group)2
>>>                      0.331548                             NA
>>>
>>>
>>> I guess this is due to model matrix being singular or collinearity among
>>> the matrix columns? But I can't figure out how the matrix is singular in
>>> this case? Can someone show me why this is the case?
>>
>> Because you have no cases in one of the crossed categories.
>>
>> --David Winsemius, MD
>> West Hartford, CT
>>        [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>