[R] How are interaction terms computed in lm's result / problems with interaction terms in lm?

Sun Sep 18 21:18:21 CEST 2016

> On Sep 18, 2016, at 11:01 AM, mviljamaa <mviljamaa at kapsi.fi> wrote:
> 
> Also if you, rather than doing what's done below, do:
> 
> fit3 <- lm(kidmomhsage$kid_score ~ kidmomhsage$mom_age + kidmomhsage$mom_hs + kidmomhsage$mom_age * kidmomhsage$mom_hs)
> 
> Then this gives the result:
> 
> Call:
> lm(formula = kidmomhsage$kid_score ~ kidmomhsage$mom_age + kidmomhsage$mom_hs +
>    kidmomhsage$mom_age * kidmomhsage$mom_hs)
> 
> Coefficients:
>                           (Intercept)
>                               110.542
>                   kidmomhsage$mom_age
>                                -1.522
>                    kidmomhsage$mom_hs
>                               -41.287
> kidmomhsage$mom_age:kidmomhsage$mom_hs
>                                 2.391
> 
> Where the interaction term now seems properly interpretable. So perhaps this is the way to use interaction terms with lm.
> 
> However, in the above, is the coefficient 2.391 of kidmomhsage$mom_age:kidmomhsage$mom_hs actually only that for mom_hs == 1 in which case for mom_hs == 0 one would simply ignore the last coefficient?

Yes.

In all of this it would much clearer and safer if you supplied a dataframe to the data parameter of lm:

lm(formula =kid_score ~ mom_age +mom_hs + mom_age*mom_hs, data= kidmomhsage)

> 
> And would one still need to perform summations of kidmomhsage$mom_age and kidmomhsage$mom_age:kidmomhsage$mom_hs coefficients, i.e. the coefficient for kidmomhsage$mom_age = -1.522 + 2.391?

Yes, at least if I'm understanding your terminology. That is the net mom_age coefficient for those subjects with mom_hs values not at the base level.

> 
> 
> On 2016-09-18 20:41, mviljamaa wrote:
>> I'm trying to use interaction terms in lm and for the following types of models:
>> fit3_hs <- lm(kidmomhsage$kid_score ~ kidmomhsage$mom_age +
>> kidmomhsage$mom_hs + kidmomhsage$mom_age * 1)
>> fit3_nohs <- lm(kidmomhsage$kid_score ~ kidmomhsage$mom_age +
>> kidmomhsage$mom_hs + kidmomhsage$mom_age * 0)
>> where you see the last term being the interaction term (it's
>> mom_age*mom_hs where mom_hs takes values 0 or 1), the results are
>> causing a bit of confusion.
>> fit3_hs returns:
>> Call:
>> lm(formula = kidmomhsage$kid_score ~ kidmomhsage$mom_age + kidmomhsage$mom_hs +
>>    kidmomhsage$mom_age * 1)
>> Coefficients:
>>        (Intercept)  kidmomhsage$mom_age
>>            70.4787               0.3261
>> kidmomhsage$mom_hs
>>            11.3112
>> fit3_nohs returns:
>> Call:
>> lm(formula = kidmomhsage$kid_score ~ kidmomhsage$mom_age + kidmomhsage$mom_hs +
>>    kidmomhsage$mom_age * 0)
>> Coefficients:
>> kidmomhsage$mom_age   kidmomhsage$mom_hs
>>              3.368               11.568
>> Now why is (Intercept) term missing from the second one?

In R, formula terms `1` and `0` have special meaning. In the first model you "formula-added" mom_age to mom_age and got, not 2*mom_age, but rather just mom_age. In the second model you got the formula equivalent of `mom_age + mom_hs + 0` which is an intercept-free specification. Read:

?formula

I misremembered a pithy summary of this topic that I thought was by Greg Snow in the fortunes package about why one should almost never use intercept free models, but it's not showing up for me, but perhaps some of these Rhelp threads will be useful:

http://markmail.org/message/o7kbarfvpdobmdir?q=list:org%2Er-project%2Er-help+snow+intercept+0

You could easily substitute 'ripley', 'lumley' or several other names in that search strategy in Rhelp's archives and get equally credible material.

>> Also since in the first one the interaction term's coefficient should
>> be added to the coefficient of mom_age, then is the return value of
>> kidmomhsage$mom_age 0.3261 the sum of the coefficient of mom_age and
>> the coefficient of the interaction term? Or would I need to produce
>> the sum myself somehow?

In the first one the intercept is the mean predicted score for a mom_age of zero and an mon_hs at the base value, so it is essentially setting a reference value to be added to any of the _age and _hs increments or decrements for cases of particular values of those covariates. The mom_age coefficient is averaged over the cases with both values of mom_hs. 

These sound like questions whose answers are typically learned in a first course on regression. So the answer _should_ all be in whatever standard regression textbook you _should_ be reading. They are only borderline on-topic for rhelp. We don't advertise as a statistics tutoring service, so I think any followup questions on this matter of interpreting model output should be directed to CrossValidated.com

As the standard sig says: Please read the Posting Guide and the second line as well.
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA