[R] About the interaction A:B

Jeff Laake Jeff.Laake at noaa.gov
Fri Mar 5 19:32:32 CET 2010


You are correct that you need to use  ~-1+A:B.  I use that all the time 
and just spaced it out when I was writing
the response.  Using ~A:B will produce one too many columns.

Didn't follow your other question.  You can always look at the result of 
model.matrix to see if it is correct or dig into the
internal code.  Personally, I've never found it to have a problem.

--jeff

On 3/5/2010 10:16 AM, blue sky wrote:
> On Fri, Mar 5, 2010 at 11:41 AM, Jeff Laake<Jeff.Laake at noaa.gov>  wrote:
>    
>> On 3/5/2010 9:19 AM, Frank E Harrell Jr wrote:
>>      
>>> You neglected to state your name and affiliation, and your question
>>> demonstrates an allergy to R documentation.
>>>
>>> Frank
>>>        
>> I agree with Frank but will try to answer some of your questions as I
>> understand it.
>>
>> First, model.matrix uses the options$contrasts or what is set specifically
>> as the contrast for a factor using contrasts().  The default is
>> treatment contrasts and for a factor that means the first level of a factor
>> variable is the intercept and the remainder are "treatment" effects
>> which are added to intercept.  If you specify  Y~A+B+A:B, you are specifying
>> the model with main effects (A, B) and interactions (A:B).
>>      
> So my interpretation of how R does internally when dealing with
> interaction terms A:B is to look if each individual term (each sub
> interactions in highway interactions like, A:B, A:C, etc., in A:B:C:D)
> appears in the formula or not, and deciding how to construct the
> columns of the model matrix according to A:B, right?
>
> This seem quite complicated when a formula is arbitrarily complex, let
> along different type of data (namely ordered, not ordered, numerical
> numbers). The piece of information that I feel that is still missing
> to me is the detailed procedure that R generate model.matrix for
> arbitrary complex formulas and how it is proved the way that R does is
> always correct for these complex formulas.
>
>    
>> If A has m levels and B has n levels then there will be an intercept (1),
>> m-1 for A, n-1 for B and (m-1)*(n-1) for the interaction.
>> If you specify the model as Y~A:B it will specify the model fully with
>> interactions which will have m*n separate parameters and none will be
>> NA as long as you have data in each of the m*n cells.  It makes absolutely
>> no sense to me to have an intercept and then all but one of the
>> remaining interactions included.
>>      
> I think that 'Y ~ A:B' indeed has the intercept term unless '-1' is
> included. You may need to adjust your interpretation of A:B and A:B -1
> in these cases.
>    
>> dim(my_subset(model.matrix(Y ~ A:B - 1,fr)))
>>      
> [1] 12 12
>    
>> dim(my_subset(model.matrix(Y ~ A:B,fr)))
>>      
> [1] 12 13
>
>
>    
>> Note that you'll get quite different results if A and/or B are not factor
>> variables.
>>
>> --jeff
>>      
>>> blue sky wrote:
>>>        
>>>> The following is the code for the model.matrix. But it still doesn't
>>>> answer why A:B is interpreted differently in Y~A+B+A:B and Y~A:B. By
>>>> 'why', I mean how R internally does it and what is the rational behind
>>>> the way of doing it?
>>>>
>>>> And it didn't answer why in the model.matrix of Y~A, there are a-1
>>>> terms from A plus the intercept, but in the model.matrix of Y~A:B,
>>>> there are a*b terms (rather than a*b-1 terms) plus the intercept.
>>>> Since the one coefficient of the lm of Y~A:B is going to be NA, why
>>>> bother to include the corresponding term in the model matrix?
>>>>
>>>> ############code below
>>>>
>>>> set.seed(0)
>>>>
>>>> a=3
>>>> b=4
>>>>
>>>> AB_effect=data.frame(
>>>>   name=paste(
>>>>     unlist(
>>>>       do.call(
>>>>         rbind
>>>>         , rep(list(paste('A', letters[1:a],sep='')), b)
>>>>         )
>>>>       )
>>>>     , unlist(
>>>>       do.call(
>>>>         cbind
>>>>         , rep(list(paste('B', letters[1:b],sep='')), a)
>>>>         )
>>>>       )
>>>>     , sep=':'
>>>>     )
>>>>   , value=rnorm(a*b)
>>>>   , stringsAsFactors=F
>>>>   )
>>>>
>>>> max_n=10
>>>> n=sample.int(max_n, a*b, replace=T)
>>>>
>>>> AB=mapply(function(name, n){rep(name,n)}, AB_effect$name, n)
>>>>
>>>> Y=AB_effect$value[match(unlist(AB), AB_effect$name)]
>>>>
>>>> Y=Y+a*b*rnorm(length(Y))
>>>>
>>>> sub_fr=as.data.frame(do.call(rbind, strsplit(unlist(AB), ':')))
>>>> rownames(sub_fr)=NULL
>>>> colnames(sub_fr)=c('A', 'B')
>>>>
>>>> fr=data.frame(Y=Y,sub_fr)
>>>>
>>>> my_subset=function(amm) {
>>>>   coding=apply(
>>>>     amm
>>>>     ,1
>>>>     , function(x) {
>>>>       paste(x, collapse='')
>>>>     }
>>>>     )
>>>>   amm[match(unique(coding), coding),]
>>>> }
>>>>
>>>> my_subset(model.matrix(Y ~ A*B,fr))
>>>> my_subset(model.matrix(Y ~ (A+B)^2,fr))
>>>> my_subset(model.matrix(Y ~ A + B + A:B,fr))
>>>> my_subset(model.matrix(Y ~ A:B - 1,fr))
>>>> my_subset(model.matrix(Y ~ A:B,fr))
>>>>
>>>> On Fri, Mar 5, 2010 at 8:45 AM, Gabor Grothendieck
>>>> <ggrothendieck at gmail.com>  wrote:
>>>>          
>>>>> The way to understand this is to look at the output of model.matrix:
>>>>>
>>>>> model.matrix(fo, fr)
>>>>>
>>>>> for each formula you tried.  If your data is large you will have to
>>>>> use a subset not to be overwhelmed with output.
>>>>>
>>>>> On Fri, Mar 5, 2010 at 9:08 AM, blue sky<bluesky315 at gmail.com>  wrote:
>>>>>            
>>>>>> Suppose, 'fr' is data.frame with columns 'Y', 'A' and 'B'. 'A' has
>>>>>> levels 'Aa'
>>>>>> 'Ab' and 'Ac', and 'B' has levels 'Ba', 'Bb', 'Bc' and 'Bd'. 'Y'
>>>>>> columns are numbers.
>>>>>>
>>>>>> I tried the following three sets of commands. I understand that A*B is
>>>>>> equivalent to A+B+A:B. However, A:B in A+B+A:B is different from A:B
>>>>>> just by itself (see the 3rd and 4th set of commands). Would you please
>>>>>> help me understand why the meanings of A:B are different in different
>>>>>> contexts?
>>>>>>
>>>>>> I also see the coefficient of AAc:BBd is NA (the last set of
>>>>>> commands). I'm wondering why this coefficient is not removed from the
>>>>>> 'coefficients' vector. Since lm(Y~A) has coefficients for (intercept),
>>>>>> Ab, Ac (there are no NA's), I think that it is reasonable to make sure
>>>>>> that there are no NA's when there are interaction terms, namely, A:B
>>>>>> in this case.
>>>>>>
>>>>>> Thank you for answering my questions!
>>>>>>
>>>>>>              
>>>>>>> alm=lm(Y ~ A*B,fr)
>>>>>>> alm$coefficients
>>>>>>>                
>>>>>> (Intercept)         AAb         AAc         BBb         BBc         BBd
>>>>>>   -3.548176   -2.086586    7.003857    4.367800   11.887356   -3.470840
>>>>>>   AAb:BBb     AAc:BBb     AAb:BBc     AAc:BBc     AAb:BBd     AAc:BBd
>>>>>>   5.160865  -11.858425  -12.853116  -20.289611    6.727401   -2.327173
>>>>>>              
>>>>>>> alm=lm(Y ~ A + B + A:B,fr)
>>>>>>> alm$coefficients
>>>>>>>                
>>>>>> (Intercept)         AAb         AAc         BBb         BBc         BBd
>>>>>>   -3.548176   -2.086586    7.003857    4.367800   11.887356   -3.470840
>>>>>>   AAb:BBb     AAc:BBb     AAb:BBc     AAc:BBc     AAb:BBd     AAc:BBd
>>>>>>   5.160865  -11.858425  -12.853116  -20.289611    6.727401   -2.327173
>>>>>>              
>>>>>>> alm=lm(Y ~ A:B - 1,fr)
>>>>>>> alm$coefficients
>>>>>>>                
>>>>>>   AAa:BBa    AAb:BBa    AAc:BBa    AAa:BBb    AAb:BBb    AAc:BBb
>>>>>>   AAa:BBc
>>>>>> -3.5481765 -5.6347625  3.4556808  0.8196231  3.8939016 -4.0349449
>>>>>>   8.3391795
>>>>>>   AAb:BBc    AAc:BBc    AAa:BBd    AAb:BBd    AAc:BBd
>>>>>> -6.6005222 -4.9465744 -7.0190168 -2.3782017 -2.3423322
>>>>>>              
>>>>>>> alm=lm(Y ~ A:B,fr)
>>>>>>> alm$coefficients
>>>>>>>                
>>>>>> (Intercept)     AAa:BBa     AAb:BBa     AAc:BBa     AAa:BBb     AAb:BBb
>>>>>> -2.34233221 -1.20584424 -3.29243033  5.79801305  3.16195534  6.23623377
>>>>>>   AAc:BBb     AAa:BBc     AAb:BBc     AAc:BBc     AAa:BBd     AAb:BBd
>>>>>> -1.69261273 10.68151168 -4.25819000 -2.60424217 -4.67668454 -0.03586951
>>>>>>   AAc:BBd
>>>>>>        NA
>>>>>>
>>>>>>              
>>>        
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>      
>



More information about the R-help mailing list