[R] why NA coefficients
William Dunlap
wdunlap at tibco.com
Tue Nov 8 20:32:29 CET 2011
It might make the discussion easier to follow if you used
a smaller dataset that anyone can make and did some experiments
with contrasts. E.g.,
> D <- data.frame(expand.grid(X1=LETTERS[1:3], X2=letters[24:26])[-1,], Y=2^(1:8))
> D
X1 X2 Y
2 B x 2
3 C x 4
4 A y 8
5 B y 16
6 C y 32
7 A z 64
8 B z 128
9 C z 256
> lm(data=D, Y ~ X1 * X2)
Call:
lm(formula = Y ~ X1 * X2, data = D)
Coefficients:
(Intercept) X1B X1C
-188 190 192
X2y X2z X1B:X2y
196 252 -182
X1C:X2y X1B:X2z X1C:X2z
-168 -126 NA
> lm(data=D, Y ~ X1 * X2, contrasts=list(X2="contr.SAS"))
Call:
lm(formula = Y ~ X1 * X2, data = D, contrasts = list(X2 = "contr.SAS"))
Coefficients:
(Intercept) X1B X1C
64 64 192
X2x X2y X1B:X2x
-252 -56 126
X1C:X2x X1B:X2y X1C:X2y
NA -56 -168
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of array chip
> Sent: Tuesday, November 08, 2011 10:57 AM
> To: Dennis Murphy
> Cc: r-help at r-project.org
> Subject: Re: [R] why NA coefficients
>
> Hi Dennis, Thanks very much for the details. All those explanations about non-estimable mu_{12} when
> it has no data make sense to me!
>
> Regarding my specific example data where mu_{12} should NOT be estimable in a linear model with
> interaction because it has no data, yet the linear model I created by using lm() in R still CAN
> estimate the mean mu_{12}, while on the other hand, mu_{72} is instead NOT estimable from lm() even
> this category does have data. Does this contradiction to the theory imply that the linear model by
> lm() in R on my specific example data is NOT reliable/trustable and should not be used?
>
> Thanks
>
> John
>
>
>
> ________________________________
> From: Dennis Murphy <djmuser at gmail.com>
>
> Cc: "r-help at r-project.org" <r-help at r-project.org>
> Sent: Tuesday, November 8, 2011 10:22 AM
> Subject: Re: [R] why NA coefficients
>
> The cell mean mu_{12} is non-estimable because it has no data in the
> cell. How can you estimate something that's not there (at least
> without imputation :)? Every parametric function that involves mu_{12}
> will also be non-estimable - in particular, the interaction term and
> the population marginal means . That's why you get the NA estimates
> and the warning. All this follows from the linear model theory
> described in, for example, Milliken and Johnson (1992), Analysis of
> Messy Data, vol. 1, ch. 13.
>
> Here's an example from Milliken and Johnson (1992) to illustrate:
> B1 B2 B3
> T1 2, 6 8, 6
> T2 3 14 12, 9
> T3 6 9
>
> Assume a cell means model E(Y_{ijk}) = \mu_{ij}, where the cell means
> are estimated by the cell averages.
>
> From M & J (p. 173, whence this example is taken):
> "Whenever treatment combinations are missing, certain
> hypotheses cannot be tested without making some
> additional assumptions about the parameters in the model.
> Hypotheses involving parameters corresponding to the
> missing cells generally cannot be tested. For example,
> for the data [above] it is not possible to estimate any
> linear combinations (or to test any hypotheses) that
> involve parameters \mu_{12} and \mu_{33} unless one
> is willing to make some assumptions about them."
>
> They continue:
> "One common assumption is that there is no
> interactions between the levels of T and the levels of B.
> In our opinion, this assumption should not be made
> without some supporting experimental evidence."
>
> In other words, removing the interaction term makes the
> non-estimability problem disappear, but it's a copout unless there is
> some tangible scientific justification for an additive rather than an
> interaction model.
>
> For the above data, M & J note that it is not possible to estimate all
> of the expected marginal means - in particular, one cannot estimate
> the population marginal means $\bar{\mu}_{1.}$, $\bar{\mu}_{3.}$,
> $\bar{\mu}_{.2}$ or $\bar{\mu}_{.3}$. OTOH, $\bar{\mu}_{2.}$ and
> $\bar{\mu}_{.1}$ since these functions of the parameters involve terms
> associated with the means of the missing cells. Moreover, any
> hypotheses involving parametric functions that contain non-estimable
> cell means are not testable. In this example, the test of equal row
> population marginal means is not testable because $\bar{\mu}_{1.}$ and
> $\bar{\mu}_{3.}$ are not estimable.
>
> [Aside: if the term parametric function is not familiar, in this
> context it refers to linear combinations of model parameters. In the
> M & J example, $\bar{\mu}_{1.} = \mu_{11} + \mu_{12} + \mu_{13}$ is a
> parametric function.]
>
> Hopefully this sheds some light on the situation.
>
> Dennis
>
>
> > Hi Dennis,
> > The cell mean mu_12 from the model involves the intercept and factor 2:
> > Coefficients:
> > (Intercept) factor(treat)2
> > factor(treat)3
> > 0.429244
> > 0.499982 0.352971
> > factor(treat)4 factor(treat)5
> > factor(treat)6
> > -0.204752
> > 0.142042 0.044155
> > factor(treat)7 factor(group)2
> > factor(treat)2:factor(group)2
> > -0.007775
> > -0.337907 -0.208734
> > factor(treat)3:factor(group)2 factor(treat)4:factor(group)2
> > factor(treat)5:factor(group)2
> > -0.195138
> > 0.800029 0.227514
> > factor(treat)6:factor(group)2 factor(treat)7:factor(group)2
> > 0.331548 NA
> > So mu_12 = 0.429244-0.337907 = 0.091337. This can be verified by:
> >> predict(fit,data.frame(list(treat=1,group=2)))
> > 1
> > 0.09133691
> > Warning message:
> > In predict.lm(fit, data.frame(list(treat = 1, group = 2))) :
> > prediction from a rank-deficient fit may be misleading
> >
> > But as you can see, it gave a warning about rank-deficient fit... why this
> > is a rank-deficient fit?
> > Because "treat 1_group 2" has no cases, so why it is still estimable while
> > on the contrary, "treat 7_group 2" which has 2 cases is not?
> > Thanks
> > John
> >
> >
> >
> >
> > ________________________________
> > From: Dennis Murphy <djmuser at gmail.com>
>
> > Sent: Monday, November 7, 2011 9:29 PM
> > Subject: Re: [R] why NA coefficients
> >
> > Hi John:
> >
> > What is the estimate of the cell mean \mu_{12}? Which model effects
> > involve that cell mean? With this data arrangement, the expected
> > population marginal means of treatment 1 and group 2 are not estimable
> > either, unless you're willing to assume a no-interaction model.
> >
> > Chapters 13 and 14 of Milliken and Johnson's Analysis of Messy Data
> > (vol. 1) cover this topic in some detail, but it assumes you're
> > familiar with the matrix form of a linear statistical model. Both
> > chapters cover the two-way model with interaction - Ch.13 from the
> > cell means model approach and Ch. 14 from the model effects approach.
> > Because this was written in the mid 80s and republished in the early
> > 90s, all the code used is in SAS.
> >
> > HTH,
> > Dennis
> >
>
> >> Thanks David. The only category that has no cases is "treat 1-group 2":
> >>
> >>> with(test,table(treat,group))
> >> group
> >> treat 1 2
> >> 1 8 0
> >> 2 1 5
> >> 3 5 5
> >> 4 7 3
> >> 5 7 4
> >> 6 3 3
> >> 7 8 2
> >>
> >> But why the coefficient for "treat 7-group 2" is not estimable?
> >>
> >> Thanks
> >>
> >> John
> >>
> >>
> >>
> >>
> >> ________________________________
> >> From: David Winsemius <dwinsemius at comcast.net>
> >>
> >> Cc: "r-help at r-project.org" <r-help at r-project.org>
> >> Sent: Monday, November 7, 2011 5:13 PM
> >> Subject: Re: [R] why NA coefficients
> >>
> >>
> >> On Nov 7, 2011, at 7:33 PM, array chip wrote:
> >>
> >>> Hi, I am trying to run ANOVA with an interaction term on 2 factors (treat
> >>> has 7 levels, group has 2 levels). I found the coefficient for the last
> >>> interaction term is always 0, see attached dataset and the code below:
> >>>
> >>>> test<-read.table("test.txt",sep='\t',header=T,row.names=NULL)
> >>>> lm(y~factor(treat)*factor(group),test)
> >>>
> >>> Call:
> >>> lm(formula = y ~ factor(treat) * factor(group), data = test)
> >>>
> >>> Coefficients:
> >>> (Intercept) factor(treat)2
> >>> factor(treat)3
> >>> 0.429244 0.499982
> >>> 0.352971
> >>> factor(treat)4 factor(treat)5
> >>> factor(treat)6
> >>> -0.204752 0.142042
> >>> 0.044155
> >>> factor(treat)7 factor(group)2
> >>> factor(treat)2:factor(group)2
> >>> -0.007775 -0.337907
> >>> -0.208734
> >>> factor(treat)3:factor(group)2 factor(treat)4:factor(group)2
> >>> factor(treat)5:factor(group)2
> >>> -0.195138 0.800029
> >>> 0.227514
> >>> factor(treat)6:factor(group)2 factor(treat)7:factor(group)2
> >>> 0.331548 NA
> >>>
> >>>
> >>> I guess this is due to model matrix being singular or collinearity among
> >>> the matrix columns? But I can't figure out how the matrix is singular in
> >>> this case? Can someone show me why this is the case?
> >>
> >> Because you have no cases in one of the crossed categories.
> >>
> >> --David Winsemius, MD
> >> West Hartford, CT
> >> [[alternative HTML version deleted]]
> >>
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >>
> >
> >
> >
> [[alternative HTML version deleted]]
More information about the R-help
mailing list