[R] About the interaction A:B

Fri Mar 5 19:16:33 CET 2010

On Fri, Mar 5, 2010 at 11:41 AM, Jeff Laake <Jeff.Laake at noaa.gov> wrote:
> On 3/5/2010 9:19 AM, Frank E Harrell Jr wrote:
>>
>> You neglected to state your name and affiliation, and your question
>> demonstrates an allergy to R documentation.
>>
>> Frank
>
> I agree with Frank but will try to answer some of your questions as I
> understand it.
>
> First, model.matrix uses the options$contrasts or what is set specifically
> as the contrast for a factor using contrasts().  The default is
> treatment contrasts and for a factor that means the first level of a factor
> variable is the intercept and the remainder are "treatment" effects
> which are added to intercept.  If you specify  Y~A+B+A:B, you are specifying
> the model with main effects (A, B) and interactions (A:B).

So my interpretation of how R does internally when dealing with
interaction terms A:B is to look if each individual term (each sub
interactions in highway interactions like, A:B, A:C, etc., in A:B:C:D)
appears in the formula or not, and deciding how to construct the
columns of the model matrix according to A:B, right?

This seem quite complicated when a formula is arbitrarily complex, let
along different type of data (namely ordered, not ordered, numerical
numbers). The piece of information that I feel that is still missing
to me is the detailed procedure that R generate model.matrix for
arbitrary complex formulas and how it is proved the way that R does is
always correct for these complex formulas.

> If A has m levels and B has n levels then there will be an intercept (1),
> m-1 for A, n-1 for B and (m-1)*(n-1) for the interaction.
> If you specify the model as Y~A:B it will specify the model fully with
> interactions which will have m*n separate parameters and none will be
> NA as long as you have data in each of the m*n cells.  It makes absolutely
> no sense to me to have an intercept and then all but one of the
> remaining interactions included.

I think that 'Y ~ A:B' indeed has the intercept term unless '-1' is
included. You may need to adjust your interpretation of A:B and A:B -1
in these cases.
> dim(my_subset(model.matrix(Y ~ A:B - 1,fr)))
[1] 12 12
> dim(my_subset(model.matrix(Y ~ A:B,fr)))
[1] 12 13

> Note that you'll get quite different results if A and/or B are not factor
> variables.
>
> --jeff
>>
>> blue sky wrote:
>>>
>>> The following is the code for the model.matrix. But it still doesn't
>>> answer why A:B is interpreted differently in Y~A+B+A:B and Y~A:B. By
>>> 'why', I mean how R internally does it and what is the rational behind
>>> the way of doing it?
>>>
>>> And it didn't answer why in the model.matrix of Y~A, there are a-1
>>> terms from A plus the intercept, but in the model.matrix of Y~A:B,
>>> there are a*b terms (rather than a*b-1 terms) plus the intercept.
>>> Since the one coefficient of the lm of Y~A:B is going to be NA, why
>>> bother to include the corresponding term in the model matrix?
>>>
>>> ############code below
>>>
>>> set.seed(0)
>>>
>>> a=3
>>> b=4
>>>
>>> AB_effect=data.frame(
>>>  name=paste(
>>>    unlist(
>>>      do.call(
>>>        rbind
>>>        , rep(list(paste('A', letters[1:a],sep='')), b)
>>>        )
>>>      )
>>>    , unlist(
>>>      do.call(
>>>        cbind
>>>        , rep(list(paste('B', letters[1:b],sep='')), a)
>>>        )
>>>      )
>>>    , sep=':'
>>>    )
>>>  , value=rnorm(a*b)
>>>  , stringsAsFactors=F
>>>  )
>>>
>>> max_n=10
>>> n=sample.int(max_n, a*b, replace=T)
>>>
>>> AB=mapply(function(name, n){rep(name,n)}, AB_effect$name, n)
>>>
>>> Y=AB_effect$value[match(unlist(AB), AB_effect$name)]
>>>
>>> Y=Y+a*b*rnorm(length(Y))
>>>
>>> sub_fr=as.data.frame(do.call(rbind, strsplit(unlist(AB), ':')))
>>> rownames(sub_fr)=NULL
>>> colnames(sub_fr)=c('A', 'B')
>>>
>>> fr=data.frame(Y=Y,sub_fr)
>>>
>>> my_subset=function(amm) {
>>>  coding=apply(
>>>    amm
>>>    ,1
>>>    , function(x) {
>>>      paste(x, collapse='')
>>>    }
>>>    )
>>>  amm[match(unique(coding), coding),]
>>> }
>>>
>>> my_subset(model.matrix(Y ~ A*B,fr))
>>> my_subset(model.matrix(Y ~ (A+B)^2,fr))
>>> my_subset(model.matrix(Y ~ A + B + A:B,fr))
>>> my_subset(model.matrix(Y ~ A:B - 1,fr))
>>> my_subset(model.matrix(Y ~ A:B,fr))
>>>
>>> On Fri, Mar 5, 2010 at 8:45 AM, Gabor Grothendieck
>>> <ggrothendieck at gmail.com> wrote:
>>>>
>>>> The way to understand this is to look at the output of model.matrix:
>>>>
>>>> model.matrix(fo, fr)
>>>>
>>>> for each formula you tried.  If your data is large you will have to
>>>> use a subset not to be overwhelmed with output.
>>>>
>>>> On Fri, Mar 5, 2010 at 9:08 AM, blue sky <bluesky315 at gmail.com> wrote:
>>>>>
>>>>> Suppose, 'fr' is data.frame with columns 'Y', 'A' and 'B'. 'A' has
>>>>> levels 'Aa'
>>>>> 'Ab' and 'Ac', and 'B' has levels 'Ba', 'Bb', 'Bc' and 'Bd'. 'Y'
>>>>> columns are numbers.
>>>>>
>>>>> I tried the following three sets of commands. I understand that A*B is
>>>>> equivalent to A+B+A:B. However, A:B in A+B+A:B is different from A:B
>>>>> just by itself (see the 3rd and 4th set of commands). Would you please
>>>>> help me understand why the meanings of A:B are different in different
>>>>> contexts?
>>>>>
>>>>> I also see the coefficient of AAc:BBd is NA (the last set of
>>>>> commands). I'm wondering why this coefficient is not removed from the
>>>>> 'coefficients' vector. Since lm(Y~A) has coefficients for (intercept),
>>>>> Ab, Ac (there are no NA's), I think that it is reasonable to make sure
>>>>> that there are no NA's when there are interaction terms, namely, A:B
>>>>> in this case.
>>>>>
>>>>> Thank you for answering my questions!
>>>>>
>>>>>> alm=lm(Y ~ A*B,fr)
>>>>>> alm$coefficients
>>>>>
>>>>> (Intercept)         AAb         AAc         BBb         BBc         BBd
>>>>>  -3.548176   -2.086586    7.003857    4.367800   11.887356   -3.470840
>>>>>  AAb:BBb     AAc:BBb     AAb:BBc     AAc:BBc     AAb:BBd     AAc:BBd
>>>>>  5.160865  -11.858425  -12.853116  -20.289611    6.727401   -2.327173
>>>>>>
>>>>>> alm=lm(Y ~ A + B + A:B,fr)
>>>>>> alm$coefficients
>>>>>
>>>>> (Intercept)         AAb         AAc         BBb         BBc         BBd
>>>>>  -3.548176   -2.086586    7.003857    4.367800   11.887356   -3.470840
>>>>>  AAb:BBb     AAc:BBb     AAb:BBc     AAc:BBc     AAb:BBd     AAc:BBd
>>>>>  5.160865  -11.858425  -12.853116  -20.289611    6.727401   -2.327173
>>>>>>
>>>>>> alm=lm(Y ~ A:B - 1,fr)
>>>>>> alm$coefficients
>>>>>
>>>>>  AAa:BBa    AAb:BBa    AAc:BBa    AAa:BBb    AAb:BBb    AAc:BBb
>>>>>  AAa:BBc
>>>>> -3.5481765 -5.6347625  3.4556808  0.8196231  3.8939016 -4.0349449
>>>>>  8.3391795
>>>>>  AAb:BBc    AAc:BBc    AAa:BBd    AAb:BBd    AAc:BBd
>>>>> -6.6005222 -4.9465744 -7.0190168 -2.3782017 -2.3423322
>>>>>>
>>>>>> alm=lm(Y ~ A:B,fr)
>>>>>> alm$coefficients
>>>>>
>>>>> (Intercept)     AAa:BBa     AAb:BBa     AAc:BBa     AAa:BBb     AAb:BBb
>>>>> -2.34233221 -1.20584424 -3.29243033  5.79801305  3.16195534  6.23623377
>>>>>  AAc:BBb     AAa:BBc     AAb:BBc     AAc:BBc     AAa:BBd     AAb:BBd
>>>>> -1.69261273 10.68151168 -4.25819000 -2.60424217 -4.67668454 -0.03586951
>>>>>  AAc:BBd
>>>>>       NA
>>>>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>