[R] Why does the order of terms in a formula translate into different models/ model matrices?

Tue May 1 15:02:44 CEST 2012

I know this is three months old, but I had put it on a mental TODO list to see whether it was something that could be fixed, and I finally got a little spare time for that sort of thing. It turns out that it can NOT be fixed (without fundamental design changes), so I thought I'd better record the conclusion for the archives.

The reason is pretty simple: It is terms.formula that decides which terms require contrast coding and which need dummy coding. It does this based on the formula structure _only_; i.e., it has no information on whether variables are factors or numerical. E.g., you can do

> terms(~ a:x + a:b)
~a:x + a:b
attr(,"variables")
list(a, x, b)
attr(,"factors")
  a:x a:b
a   2   2
x   2   0
b   0   1
attr(,"term.labels")
[1] "a:x" "a:b"
attr(,"order")
[1] 2 2
attr(,"intercept")
[1] 1
attr(,"response")
[1] 0
attr(,".Environment")
<environment: R_GlobalEnv>

in which there is no indication to terms() what kind of variable "x" is. So if you want different behavior if x is a factor than if it is continuous, you're out of luck...

Not quite sure the last three lines of my note below make sense, though.

-pd

On Jan 29, 2012, at 11:24 , peter dalgaard wrote:

> 
> On Jan 29, 2012, at 02:42 , Ben Bolker wrote:
>>> 
>> 
>> (quoting removed to make Gmane happy)
>> 
>> AFAICS, this is a bug.
>> 
>> 
>> I think so too, although I haven't got my head around it yet.
>> 
>> Chuck, are you willing to post a summary of this to r-devel
>> for discussion ... and/or post a bug report?
> 
> You have to be very specific about what the bug is, if any... I.e., which precisely are the rules that are broken by the current behavior? 
> 
> Preferably also suggest a fix --- the corner cases of model.matrix and friends is some of the more impenetrable code in the R sources.
> 
> Notice that order dependent parametrization of terms is not a bug per se, nor is the automatic switch to dummy coding of factors. Consider these cases:
> 
> d <- cbind(expand.grid(a=c("a","b"),b=c("X","Y"),c=c("U","W")),x=1:8)
> model.matrix(~ a:b + a:c, d)
> model.matrix(~ a:c + a:b, d)
> model.matrix(~ a:b + a:x, d)
> model.matrix(~ a:x + a:b, d)
> 
> and notice that the logic applying to "x" is the same as that applying to "c". 
> 
> The crux seems to be that the model ~a:c contains the model ~a whereas ~a:x does not, and hence the rationale for _not_ expanding a subsequent a:b term to dummies (namely that ~a is "already in" the model) fails. 
> 
> -- 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
> 
> 
> 
> 
> 
> 
> 
> 

-- 
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com