[Rd] Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing

Thu Nov 2 07:51:13 CET 2017

Hello Tyler,

Thank you for searching for, and finding, the basic description of the
behavior of R in this matter.

I think your example is in agreement with the book.

But let me first note the following. You write: "F_j refers to a
factor (variable) in a model and not a categorical factor". However:
"a factor is a vector object used to specify a discrete
classification" (start of chapter 4 of "An Introduction to R".) You
might also see the description of the R function factor().

You note that the book says about a factor F_j:
  "... F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

You find:
   "However, the example I gave demonstrated that this dummy variable
encoding only occurs for the model where the missing term is the
numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2."

We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor). Then
T_{i(j)} = X1:X2, which is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie

On Tue, Oct 31, 2017 at 4:01 PM, Tyler <tylermw at gmail.com> wrote:
> Hi Arie,
>
> Thank you for your further research into the issue.
>
> Regarding Stata: On the other hand, JMP gives model matrices that use the
> main effects contrasts in computing the higher order interactions, without
> the dummy variable encoding. I verified this both by analyzing the linear
> model given in my first example and noting that JMP has one more degree of
> freedom than R for the same model, as well as looking at the generated model
> matrices. It's easy to find a design where JMP will allow us fit our model
> with goodness-of-fit estimates and R will not due to the extra degree(s) of
> freedom required. Let's keep the conversation limited to R.
>
> I want to refocus back onto my original bug report, which was not for a
> missing main effects term, but rather for a missing lower-order interaction
> term. The behavior of model.matrix.default() for a missing main effects term
> is a nice example to demonstrate how model.matrix encodes with dummy
> variables instead of contrasts, but doesn't demonstrate the inconsistent
> behavior my bug report highlighted.
>
> I went looking for documentation on this behavior, and the issue stems not
> from model.matrix.default(), but rather the terms() function in interpreting
> the formula. This "clever" replacement of contrasts by dummy variables to
> maintain marginality (presuming that's the reason) is not described anywhere
> in the documentation for either the model.matrix() or the terms() function.
> In order to find a description for the behavior, I had to look in the
> underlying C code, buried above the "TermCode" function of the "model.c"
> file, which says:
>
> "TermCode decides on the encoding of a model term. Returns 1 if variable
> ``whichBit'' in ``thisTerm'' is to be encoded by contrasts and 2 if it is to
> be encoded by dummy variables.  This is decided using the heuristic
> described in Statistical Models in S, page 38."
>
> I do not have a copy of this book, and I suspect most R users do not as
> well. Thankfully, however, some of the pages describing this behavior were
> available as part of Amazon's "Look Inside" feature--but if not for that, I
> would have no idea what heuristic R was using. Since those pages could made
> unavailable by Amazon at any time, at the very least we have an problem with
> a lack of documentation.
>
> However, I still believe there is a bug when comparing R's implementation to
> the heuristic described in the book. From Statistical Models in S, page
> 38-39:
>
> "Suppose F_j is any factor included in term T_i. Let T_{i(j)} denote the
> margin of T_i for factor F_j--that is, the term obtained by dropping F_j
> from T_i. We say that T_{i(j)} has appeared in the formula if there is some
> term T_i' for i' < i such that T_i' contains all the factors appearing in
> T_{i(j)}. The usual case is that T_{i(j)} itself is one of the preceding
> terms. Then F_j is coded by contrasts if T_{i(j)} has appeared in the
> formula and by dummy variables if it has not"
>
> Here, F_j refers to a factor (variable) in a model and not a categorical
> factor, as specified later in that section (page 40): "Numeric variables
> appear in the computations as themselves, uncoded. Therefore, the rule does
> not do anything special for them, and it remains valid, in a trivial sense,
> whenever any of the F_j is numeric rather than categorical."
>
> Going back to my original example with three variables: X1 (numeric), X2
> (numeric), X3 (categorical). This heuristic prescribes encoding X1:X2:X3
> with contrasts as long as X1:X2, X1:X3, and X2:X3 exist in the formula. When
> any of the preceding terms do not exist, this heuristic tells us to use
> dummy variables to encode the interaction (e.g. "F_j [the interaction term]
> is coded ... by dummy variables if it [any of the marginal terms obtained by
> dropping a single factor in the interaction] has not [appeared in the
> formula]"). However, the example I gave demonstrated that this dummy
> variable encoding only occurs for the model where the missing term is the
> numeric-numeric interaction, "~(X1+X2+X3)^3-X1:X2". Otherwise, the
> interaction term X1:X2:X3 is encoded by contrasts, not dummy variables. This
> is inconsistent with the description of the intended behavior given in the
> book.
>
> Best regards,
> Tyler
>
>
> On Fri, Oct 27, 2017 at 2:18 PM, Arie ten Cate <arietencate at gmail.com>
> wrote:
>>
>> Hello Tyler,
>>
>> I want to bring to your attention the following document: "What
>> happens if you omit the main effect in a regression model with an
>> interaction?"
>> (https://stats.idre.ucla.edu/stata/faq/what-happens-if-you-omit-the-main-effect-in-a-regression-model-with-an-interaction).
>> This gives a useful review of the problem. Your example is Case 2: a
>> continuous and a categorical regressor.
>>
>> The numerical examples are coded in Stata, and they give the same
>> result as in R. Hence, if this is a bug in R then it is also a bug in
>> Stata. That seems very unlikely.
>>
>> Here is a simulation in R of the above mentioned Case 2 in Stata:
>>
>> df <- expand.grid(socst=c(-1:1),grp=c("1","2","3","4"))
>> print("Full model")
>> print(model.matrix(~(socst+grp)^2 ,data=df))
>> print("Example 2.1: drop socst")
>> print(model.matrix(~(socst+grp)^2 -socst ,data=df))
>> print("Example 2.2: drop grp")
>> print(model.matrix(~(socst+grp)^2 -grp ,data=df))
>>
>> This gives indeed the following regressors:
>>
>> "Full model"
>> (Intercept) socst grp2 grp3 grp4 socst:grp2 socst:grp3 socst:grp4
>> "Example 2.1: drop socst"
>> (Intercept) grp2 grp3 grp4 socst:grp1 socst:grp2 socst:grp3 socst:grp4
>> "Example 2.2: drop grp"
>> (Intercept) socst socst:grp2 socst:grp3 socst:grp4
>>
>> There is a little bit of R documentation about this, based on the
>> concept of marginality, which typically forbids a model having an
>> interaction but not the corresponding main effects. (You might see the
>> references in https://en.wikipedia.org/wiki/Principle_of_marginality )
>>     See "An Introduction to R", by Venables and Smith and the R Core
>> Team. At the bottom of page 52 (PDF: 57) it says: "Although the
>> details are complicated, model formulae in R will normally generate
>> the models that an expert statistician would expect, provided that
>> marginality is preserved. Fitting, for [a contrary] example, a model
>> with an interaction but not the corresponding main effects will in
>> general lead to surprising results ....".
>>     The Reference Manual states that the R functions dropterm() and
>> addterm() resp. drop or add only terms such that marginality is
>> preserved.
>>
>> Finally, about your singular matrix t(mm)%*%mm. This is in fact
>> Example 2.1 in Case 2 discussed above. As discussed there, in Stata
>> and in R the drop of the continuous variable has no effect on the
>> degrees of freedom here: it is just a reparameterisation of the full
>> model, protecting you against losing marginality... Hence the
>> model.matrix 'mm' is still square and nonsingular after the drop of
>> X1, unless of course when a row is removed from the matrix 'design'
>> when before creating 'mm'.
>>
>>     Arie
>>
>> On Sun, Oct 15, 2017 at 7:05 PM, Tyler <tylermw at gmail.com> wrote:
>> > You could possibly try to explain away the behavior for a missing main
>> > effects term, since without the main effects term we don't have main
>> > effect
>> > columns in the model matrix used to compute the interaction columns (At
>> > best this is undocumented behavior--I still think it's a bug, as we know
>> > how we would encode the categorical factors if they were in fact
>> > present.
>> > It's either specified in contrasts.arg or using the default set in
>> > options). However, when all the main effects are present, why would the
>> > three-factor interaction column not simply be the product of the main
>> > effect columns? In my example: we know X1, we know X2, and we know X3.
>> > Why
>> > does the encoding of X1:X2:X3 depend on whether we specified a
>> > two-factor
>> > interaction, AND only changes for specific missing interactions?
>> >
>> > In addition, I can use a two-term example similar to yours to show how
>> > this
>> > behavior results in a singular covariance matrix when, given the desired
>> > factor encoding, it should not be singular.
>> >
>> > We start with a full factorial design for a two-level continuous factor
>> > and
>> > a three-level categorical factor, and remove a single row. This design
>> > matrix does not leave enough degrees of freedom to determine
>> > goodness-of-fit, but should allow us to obtain parameter estimates.
>> >
>> >> design = expand.grid(X1=c(1,-1),X2=c("A","B","C"))
>> >> design = design[-1,]
>> >> design
>> >   X1 X2
>> > 2 -1  A
>> > 3  1  B
>> > 4 -1  B
>> > 5  1  C
>> > 6 -1  C
>> >
>> > Here, we first calculate the model matrix for the full model, and then
>> > manually remove the X1 column from the model matrix. This gives us the
>> > model matrix one would expect if X1 were removed from the model. We then
>> > successfully calculate the covariance matrix.
>> >
>> >> mm = model.matrix(~(X1+X2)^2,data=design)
>> >> mm
>> >   (Intercept) X1 X2B X2C X1:X2B X1:X2C
>> > 2           1 -1   0   0      0      0
>> > 3           1  1   1   0      1      0
>> > 4           1 -1   1   0     -1      0
>> > 5           1  1   0   1      0      1
>> > 6           1 -1   0   1      0     -1
>> >
>> >> mm = mm[,-2]
>> >> solve(t(mm) %*% mm)
>> >             (Intercept)  X2B  X2C X1:X2B X1:X2C
>> > (Intercept)           1 -1.0 -1.0    0.0    0.0
>> > X2B                  -1  1.5  1.0    0.0    0.0
>> > X2C                  -1  1.0  1.5    0.0    0.0
>> > X1:X2B                0  0.0  0.0    0.5    0.0
>> > X1:X2C                0  0.0  0.0    0.0    0.5
>> >
>> > Here, we see the actual behavior for model.matrix. The undesired
>> > re-coding
>> > of the model matrix interaction term makes the information matrix
>> > singular.
>> >
>> >> mm = model.matrix(~(X1+X2)^2-X1,data=design)
>> >> mm
>> >   (Intercept) X2B X2C X1:X2A X1:X2B X1:X2C
>> > 2           1   0   0     -1      0      0
>> > 3           1   1   0      0      1      0
>> > 4           1   1   0      0     -1      0
>> > 5           1   0   1      0      0      1
>> > 6           1   0   1      0      0     -1
>> >
>> >> solve(t(mm) %*% mm)
>> > Error in solve.default(t(mm) %*% mm) : system is computationally
>> > singular:
>> > reciprocal condition number = 5.55112e-18
>> >
>> > I still believe this is a bug.
>> >
>> > Best regards,
>> > Tyler Morgan-Wall
>> >
>> > On Sun, Oct 15, 2017 at 1:49 AM, Arie ten Cate <arietencate at gmail.com>
>> > wrote:
>> >
>> >> I think it is not a bug. It is a general property of interactions.
>> >> This property is best observed if all variables are factors
>> >> (qualitative).
>> >>
>> >> For example, you have three variables (factors). You ask for as many
>> >> interactions as possible, except an interaction term between two
>> >> particular variables. When this interaction is not a constant, it is
>> >> different for different values of the remaining variable. More
>> >> precisely: for all values of that variable. In other words: you have a
>> >> three-way interaction, with all values of that variable.
>> >>
>> >> An even smaller example is the following script with only two
>> >> variables, each being a factor:
>> >>
>> >>  df <- expand.grid(X1=c("p","q"), X2=c("A","B","C"))
>> >>  print(model.matrix(~(X1+X2)^2    ,data=df))
>> >>  print(model.matrix(~(X1+X2)^2 -X1,data=df))
>> >>  print(model.matrix(~(X1+X2)^2 -X2,data=df))
>> >>
>> >> The result is:
>> >>
>> >>   (Intercept) X1q X2B X2C X1q:X2B X1q:X2C
>> >> 1           1   0   0   0       0       0
>> >> 2           1   1   0   0       0       0
>> >> 3           1   0   1   0       0       0
>> >> 4           1   1   1   0       1       0
>> >> 5           1   0   0   1       0       0
>> >> 6           1   1   0   1       0       1
>> >>
>> >>   (Intercept) X2B X2C X1q:X2A X1q:X2B X1q:X2C
>> >> 1           1   0   0       0       0       0
>> >> 2           1   0   0       1       0       0
>> >> 3           1   1   0       0       0       0
>> >> 4           1   1   0       0       1       0
>> >> 5           1   0   1       0       0       0
>> >> 6           1   0   1       0       0       1
>> >>
>> >>   (Intercept) X1q X1p:X2B X1q:X2B X1p:X2C X1q:X2C
>> >> 1           1   0       0       0       0       0
>> >> 2           1   1       0       0       0       0
>> >> 3           1   0       1       0       0       0
>> >> 4           1   1       0       1       0       0
>> >> 5           1   0       0       0       1       0
>> >> 6           1   1       0       0       0       1
>> >>
>> >> Thus, in the second result, we have no main effect of X1. Instead, the
>> >> effect of X1 depends on the value of X2; either A or B or C. In fact,
>> >> this is a two-way interaction, including all three values of X2. In
>> >> the third result, we have no main effect of X2, The effect of X2
>> >> depends on the value of X1; either p or q.
>> >>
>> >> A complicating element with your example seems to be that your X1 and
>> >> X2 are not factors.
>> >>
>> >>    Arie
>> >>
>> >> On Thu, Oct 12, 2017 at 7:12 PM, Tyler <tylermw at gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I recently ran into an inconsistency in the way model.matrix.default
>> >> > handles factor encoding for higher level interactions with
>> >> > categorical
>> >> > variables when the full hierarchy of effects is not present.
>> >> > Depending on
>> >> > which lower level interactions are specified, the factor encoding
>> >> > changes
>> >> > for a higher level interaction. Consider the following minimal
>> >> reproducible
>> >> > example:
>> >> >
>> >> > --------------
>> >> >
>> >> >> runmatrix = expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))>
>> >> model.matrix(~(X1+X2+X3)^3,data=runmatrix)   (Intercept) X1 X2 X3B X3C
>> >> X1:X2 X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C
>> >> > 1            1  1  1   0   0     1      0      0      0      0
>> >> > 0         0
>> >> > 2            1 -1  1   0   0    -1      0      0      0      0
>> >> > 0         0
>> >> > 3            1  1 -1   0   0    -1      0      0      0      0
>> >> > 0         0
>> >> > 4            1 -1 -1   0   0     1      0      0      0      0
>> >> > 0         0
>> >> > 5            1  1  1   1   0     1      1      0      1      0
>> >> > 1         0
>> >> > 6            1 -1  1   1   0    -1     -1      0      1      0
>> >> > -1         0
>> >> > 7            1  1 -1   1   0    -1      1      0     -1      0
>> >> > -1         0
>> >> > 8            1 -1 -1   1   0     1     -1      0     -1      0
>> >> > 1         0
>> >> > 9            1  1  1   0   1     1      0      1      0      1
>> >> > 0         1
>> >> > 10           1 -1  1   0   1    -1      0     -1      0      1
>> >> > 0        -1
>> >> > 11           1  1 -1   0   1    -1      0      1      0     -1
>> >> > 0        -1
>> >> > 12           1 -1 -1   0   1     1      0     -1      0     -1
>> >> > 0         1
>> >> > attr(,"assign")
>> >> >  [1] 0 1 2 3 3 4 5 5 6 6 7 7
>> >> > attr(,"contrasts")
>> >> > attr(,"contrasts")$X3
>> >> > [1] "contr.treatment"
>> >> >
>> >> > --------------
>> >> >
>> >> > Specifying the full hierarchy gives us what we expect: the
>> >> > interaction
>> >> > columns are simply calculated from the product of the main effect
>> >> columns.
>> >> > The interaction term X1:X2:X3 gives us two columns in the model
>> >> > matrix,
>> >> > X1:X2:X3B and X1:X2:X3C, matching the products of the main effects.
>> >> >
>> >> > If we remove either the X2:X3 interaction or the X1:X3 interaction,
>> >> > we
>> >> get
>> >> > what we would expect for the X1:X2:X3 interaction, but when we remove
>> >> > the
>> >> > X1:X2 interaction the encoding for X1:X2:X3 changes completely:
>> >> >
>> >> > --------------
>> >> >
>> >> >> model.matrix(~(X1+X2+X3)^3-X1:X3,data=runmatrix)   (Intercept) X1 X2
>> >> X3B X3C X1:X2 X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C
>> >> > 1            1  1  1   0   0     1      0      0         0         0
>> >> > 2            1 -1  1   0   0    -1      0      0         0         0
>> >> > 3            1  1 -1   0   0    -1      0      0         0         0
>> >> > 4            1 -1 -1   0   0     1      0      0         0         0
>> >> > 5            1  1  1   1   0     1      1      0         1         0
>> >> > 6            1 -1  1   1   0    -1      1      0        -1         0
>> >> > 7            1  1 -1   1   0    -1     -1      0        -1         0
>> >> > 8            1 -1 -1   1   0     1     -1      0         1         0
>> >> > 9            1  1  1   0   1     1      0      1         0         1
>> >> > 10           1 -1  1   0   1    -1      0      1         0        -1
>> >> > 11           1  1 -1   0   1    -1      0     -1         0        -1
>> >> > 12           1 -1 -1   0   1     1      0     -1         0         1
>> >> > attr(,"assign")
>> >> >  [1] 0 1 2 3 3 4 5 5 6 6
>> >> > attr(,"contrasts")
>> >> > attr(,"contrasts")$X3
>> >> > [1] "contr.treatment"
>> >> >
>> >> >
>> >> >
>> >> >> model.matrix(~(X1+X2+X3)^3-X2:X3,data=runmatrix)   (Intercept) X1 X2
>> >> X3B X3C X1:X2 X1:X3B X1:X3C X1:X2:X3B X1:X2:X3C
>> >> > 1            1  1  1   0   0     1      0      0         0         0
>> >> > 2            1 -1  1   0   0    -1      0      0         0         0
>> >> > 3            1  1 -1   0   0    -1      0      0         0         0
>> >> > 4            1 -1 -1   0   0     1      0      0         0         0
>> >> > 5            1  1  1   1   0     1      1      0         1         0
>> >> > 6            1 -1  1   1   0    -1     -1      0        -1         0
>> >> > 7            1  1 -1   1   0    -1      1      0        -1         0
>> >> > 8            1 -1 -1   1   0     1     -1      0         1         0
>> >> > 9            1  1  1   0   1     1      0      1         0         1
>> >> > 10           1 -1  1   0   1    -1      0     -1         0        -1
>> >> > 11           1  1 -1   0   1    -1      0      1         0        -1
>> >> > 12           1 -1 -1   0   1     1      0     -1         0         1
>> >> > attr(,"assign")
>> >> >  [1] 0 1 2 3 3 4 5 5 6 6
>> >> > attr(,"contrasts")
>> >> > attr(,"contrasts")$X3
>> >> > [1] "contr.treatment"
>> >> >
>> >> >
>> >> >> model.matrix(~(X1+X2+X3)^3-X1:X2,data=runmatrix)   (Intercept) X1 X2
>> >> X3B X3C X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3A X1:X2:X3B X1:X2:X3C
>> >> > 1            1  1  1   0   0      0      0      0      0         1
>> >> >     0         0
>> >> > 2            1 -1  1   0   0      0      0      0      0        -1
>> >> >     0         0
>> >> > 3            1  1 -1   0   0      0      0      0      0        -1
>> >> >     0         0
>> >> > 4            1 -1 -1   0   0      0      0      0      0         1
>> >> >     0         0
>> >> > 5            1  1  1   1   0      1      0      1      0         0
>> >> >     1         0
>> >> > 6            1 -1  1   1   0     -1      0      1      0         0
>> >> >    -1         0
>> >> > 7            1  1 -1   1   0      1      0     -1      0         0
>> >> >    -1         0
>> >> > 8            1 -1 -1   1   0     -1      0     -1      0         0
>> >> >     1         0
>> >> > 9            1  1  1   0   1      0      1      0      1         0
>> >> >     0         1
>> >> > 10           1 -1  1   0   1      0     -1      0      1         0
>> >> >     0        -1
>> >> > 11           1  1 -1   0   1      0      1      0     -1         0
>> >> >     0        -1
>> >> > 12           1 -1 -1   0   1      0     -1      0     -1         0
>> >> >     0         1
>> >> > attr(,"assign")
>> >> >  [1] 0 1 2 3 3 4 4 5 5 6 6 6
>> >> > attr(,"contrasts")
>> >> > attr(,"contrasts")$X3
>> >> > [1] "contr.treatment"
>> >> >
>> >> > --------------
>> >> >
>> >> > Here, we now see the encoding for the interaction X1:X2:X3 is now the
>> >> > interaction of X1 and X2 with a new encoding for X3 where each factor
>> >> level
>> >> > is represented by its own column. I would expect, given the two
>> >> > column
>> >> > dummy variable encoding for X3, that the X1:X2:X3 column would also
>> >> > be
>> >> two
>> >> > columns regardless of what two-factor interactions we also specified,
>> >> > but
>> >> > in this case it switches to three. If other two factor interactions
>> >> > are
>> >> > missing in addition to X1:X2, this issue still occurs. This also
>> >> > happens
>> >> > regardless of the contrast specified in contrasts.arg for X3. I don't
>> >> > see
>> >> > any reasoning for this behavior given in the documentation, so I
>> >> > suspect
>> >> it
>> >> > is a bug.
>> >> >
>> >> > Best regards,
>> >> > Tyler Morgan-Wall
>> >> >
>> >> >         [[alternative HTML version deleted]]
>> >> >
>> >> > ______________________________________________
>> >> > R-devel at r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-devel
>> >>
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>