[R] Simple question about formulae in R!?

David Winsemius dwinsemius at comcast.net
Fri Aug 10 17:51:26 CEST 2012


On Aug 10, 2012, at 8:09 AM, Joshua Wiley wrote:

> On Fri, Aug 10, 2012 at 7:39 AM, Henrik Singmann
> <henrik.singmann at psychologie.uni-freiburg.de> wrote:
>> Dear Michael and list,
>>
>> R in general tries hard to prohibit this behavior (i.e., including an
>> interaction but not the main effect). When removing a main effect and
>> leaving the interaction, the number of parameters is not reduced by  
>> one (as
>> would be expected) but stays the same, at least when using  
>> model.matrix:
>
> This is because for factor or character variables, R automatically
> dummy codes them (or whatever contrasts were set, dummies are the
> default, but others could be set globally which would then propogate
> to the specific model.matrix() call).
>
>>
>> d <- data.frame(A = rep(c("a1", "a2"), each = 50), B = c("b1",  
>> "b2"), value
>> = rnorm(10))
>> ncol(model.matrix(~ A*B, data = d))
>> #  [1] 4
>> ncol(model.matrix(~ A*B - A, data = d))
>> #  [1] 4
>>

That addresses a different question than the one  being posed. This  
answers the question for the model being proposed:

 > d <- data.frame(A = rep(c("a1", "a2", "a3"), each = 50), B =  
letters[1:5], value
+ = rnorm(10), covar=rnorm(10))
 > ncol(model.matrix(~ A + B:covar, data = d))
[1] 8
 >
 > ncol(model.matrix(~ A +B*covar, data = d))
[1] 12

So, no, the A +B:covar model is not going to have the same number of  
parameters as the A+B*covar model. I am lining up with Michael on this  
one.

-- 
David.

>> I actually don't know understand how R parametrizes the model in  
>> the second
>> case, but I am pretty sure someone here might do so and be able to  
>> explain.
>
> One way to think about this is the regression versus ANOVA style of
> parameterization.  Consider these three simple models
>
> Here we have a traditional regression.  am is a two level factor, so
> it gets one dummy code with the intercept.  The model matrix looks
> something like:
>
> i a
> 1 0
> 1 0
> 1 0
> 1 1
> 1 1
> 1 1
>
> what does that mean for the expected values?  For a = 0,
>
> yhat = b0 * 1 + b1 * 0 = b0 * 1, the intercept
>
> for a = 1
>
> yhat = b0 * 1 + b1 * 1
>
> what is the difference between these two?  Merely b1.  Hence under
> this parameterization, b0 is the expected value for the group a = 0,
> and b1 is the difference in expected values between the two groups a =
> 0 and a = 1.
>
>> coef(lm(mpg ~ 1 + factor(am), data = mtcars))
> (Intercept) factor(am)1
>  17.147368    7.244939
>
> We can find the expected values for each group.  The first is just b0,
> the intercept term.  The secondn is the intercept plus b1, the single
> other coefficient.  We can get this just by adding (or summing all the
> coefficients) like so:
>
>> sum(coef(lm(mpg ~ 1 + factor(am), data = mtcars)))
> [1] 24.39231
>
> Now, if we explicitly force out the intercept, R uses a different
> coding scheme.  The intercept normally captures some reference or
> baseline group, which every other group is compared to, but if we
> force it out, the default design matrix is something like:
>
> 1 0
> 1 0
> 1 0
> 0 1
> 0 1
> 0 1
>
> this looks very similar to what we had before, only now the two
> columns are opposites.  Let's think again what our expected values
> are.  For a = 0
>
> yhat = b0 * 1 + b1 * 0 = b0 * 1
>
> for a = 1
>
> yhat = b0 * 0 + b1 * 1 = b1 * 1
>
> now b0 encodes the expected value of the group a = 0 and b1 encodes
> the expected value of the group a = 1.  By default these coefficients
> are tested against 0.  There is no more explicit test if the two
> groups are different from each other because instead of creating
> differences, we have created the overall group expectation.  In lm():
>
>> coef(lm(mpg ~ 0 + factor(am), data = mtcars))
> factor(am)0 factor(am)1
>   17.14737    24.39231
>
> both those numbers should be familiar from before.  In this case, both
> models included two parameters, the only difference is whether they
> encode group expectations or an expectation and expected difference.
> These are equivalenet models, and their likelihood etc. will be
> identical.   Note however that if you suppress the intercept, R will
> use a different formula for the R squared so those will not look
> identical (although if you got predicted values from both models,
> yhat, you would find those identical).
>
> This concept generalizes to the interaction of categorical variables
> without their main effects.  Normally, the interaction terms are
> expected differences from the reference group, but if the "main
> effect" is dropped, then instead, the interactions become the expected
> values of the various cells (if you think of a 2 x 2 table of group
> membership).
>
> For example here it is as usual:
>
>> coef(lm(mpg ~ 1 + factor(vs) * factor(am), data = mtcars))
>            (Intercept)             factor(vs)1             factor(am)1
>              15.050000                5.692857                4.700000
> factor(vs)1:factor(am)1
>               2.928571
>
>> 15.050000 + 5.692857
> [1] 20.74286
>> 15.050000 + 4.7
> [1] 19.75
>> 15.050000 + 5.692857 + 4.7 + 2.928571
> [1] 28.37143
>
> the intercept is the expected value for the group vs = 0 & am = 0,
> then you get the difference between expectations for am = 0 & vs = 0
> and am = 0 & vs = 1, then likewise for am, and finally, the additional
> bit of being am = 1 and vs = 1.  I show the by hand calculations to
> get the expected group values.  Right now, we get significance tests
> of whether these differences are statistically significantly different
> from zero.
>
> If we drop the main effects and intercept, we get:
>
>> coef(lm(mpg ~ 0 + factor(vs) : factor(am), data = mtcars))
> factor(vs)0:factor(am)0 factor(vs)1:factor(am)0  
> factor(vs)0:factor(am)1
>               15.05000                20.74286                19.75000
> factor(vs)1:factor(am)1
>               28.37143
>
> which you can see directly gets all the group expectations.
>
> Cheers,
>
> Josh
>
>> I have asked a question on how to get around this "limitation" on
>> stackoverflow with helpful answers by Ben Bolker and Joshua Wiley:
>> http://stackoverflow.com/q/11335923/289572
>> (this functionality is now used in function mixed() in my new  
>> package afex
>> for obtaining "type 3" p-values for mixed models)
>
> With few exceptions, I'm in the habit of not giving people type 3
> p-values.  People like, but generally do not actually understand them.
> I would also like to point out that I am in psychology, so yes I know
> psychology likes them.  Further, I am a doctoral student so it is not
> like I am so established that people bow to my every whim, it is work
> to explain why I am not sometimes and I have had discussions in
> various lab meetings and with advisors about this, but my point is
> that it is possible.  I would urge you to do the same and take heart
> that there are others working for change---you are not the only one
> even if it feels like it at your university.
>
>>
>> Cheers,
>> Henrik
>>
>> Am 10.08.2012 15:48, schrieb R. Michael Weylandt:
>>
>>> On Fri, Aug 10, 2012 at 7:36 AM, ONKELINX, Thierry
>>> <Thierry.ONKELINX at inbo.be> wrote:
>>>>
>>>> Dear Johan,
>>>>
>>>> Why should it be complicated? You have a very simple model, thus  
>>>> a very
>>>> simple formula. Isn't that great?
>>>>
>>>> Your formula matches the model. Though Trust~Culture + Structure *
>>>> Speed_of_Integration is another option. The model fit is the  
>>>> same, the only
>>>> difference is the parameterization of the model.
>>>
>>>
>>> Note quite, I don't think: A*B is shorthand for A + B + A:B where  
>>> A:B
>>> is the 2nd order (interaction) term. The OP only had the 2nd order
>>> term without the main effects.
>>>
>>> OP: You almost certainly want A*B -- it's strange (though certainly
>>> not impossible) to have good models where you only have interactions
>>> but not main effects.
>>>
>>> Cheers,
>>> Michael
>>>
>>>
>>>>
>>>> Best regards,
>>>>
>>>> Thierry
>>>>
>>>> ir. Thierry Onkelinx
>>>> Instituut voor natuur- en bosonderzoek / Research Institute for  
>>>> Nature
>>>> and Forest
>>>> team Biometrie & Kwaliteitszorg / team Biometrics & Quality  
>>>> Assurance
>>>> Kliniekstraat 25
>>>> 1070 Anderlecht
>>>> Belgium
>>>> + 32 2 525 02 51
>>>> + 32 54 43 61 85
>>>> Thierry.Onkelinx at inbo.be
>>>> www.inbo.be
>>>>
>>>> To call in the statistician after the experiment is done may be  
>>>> no more
>>>> than asking him to perform a post-mortem examination: he may be  
>>>> able to say
>>>> what the experiment died of.
>>>> ~ Sir Ronald Aylmer Fisher
>>>>
>>>> The plural of anecdote is not data.
>>>> ~ Roger Brinner
>>>>
>>>> The combination of some data and an aching desire for an answer  
>>>> does not
>>>> ensure that a reasonable answer can be extracted from a given  
>>>> body of data.
>>>> ~ John Tukey
>>>>
>>>>
>>>> -----Oorspronkelijk bericht-----
>>>> Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org 
>>>> ]
>>>> Namens Johan Haasnoot
>>>> Verzonden: vrijdag 10 augustus 2012 9:15
>>>> Aan: r-help at r-project.org
>>>> Onderwerp: [R] Simple question about formulae in R!?
>>>>
>>>> Good morning reader,
>>>>
>>>>
>>>>
>>>> I have encountered a, probably, simple issue with respect to the
>>>> *formulae* of a *regression model* I want to use in my research.  
>>>> I'm
>>>> researching alliances as part of my study Business Economics (focus
>>>> Strategy) at the Vrije Universiteit in Amsterdam. In the research  
>>>> model I
>>>> use a moderating variable, I'm looking for confirmation or help  
>>>> on the
>>>> formulation of the model.
>>>>
>>>>
>>>>
>>>> The research model consist of 2 explanatory variables, a moderating
>>>> variable and 1 response variable. The first explanatory variable  
>>>> is Culture,
>>>> measured on a nominal scale and the second is Structure of the  
>>>> alliance,
>>>> also measured on a nominal scale. The moderating variable in the  
>>>> relation
>>>> towards Trust is Speed of Integration, measured as an interval.
>>>> The response variable is Trust, measured on a nominal scale (highly
>>>> likely a 5-point Likert scale). Given the research model and the  
>>>> measurement
>>>> scales, I intent to use a ordered probit model, often used in  
>>>> Marketing
>>>> models,  to execute the regression modelling. I can't seem to find
>>>> confirmation on how to model the formulae. I have been reading  
>>>> and studying
>>>> R! for a couple of weeks now, read a lot of books from the R!  
>>>> series in the
>>>> past, but I can't get a grasp on this seemingly simple formulae.  
>>>> I think I
>>>> understand how to model multinomial regression (using the R- 
>>>> package MNP),
>>>> how to execute a Principal Components Analysis and an Explanatory  
>>>> Factor
>>>> analysis (obviously I'm using a questionnaire to collect my  
>>>> data), but the
>>>> formulae itself seems to be to simple.
>>>>
>>>>
>>>>
>>>> I expect to use the interaction symbol: "Trust~Culture +  
>>>> Structure :
>>>> Speed_of_Integration" for the formulae, but is it really this  
>>>> simple? Can
>>>> anybody confirm this or help me, advise me on this issue?
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>> --
>>>>
>>>> Met vriendelijke groet,
>>>>
>>>> Johan Haasnoot
>>>>
>>>> De Haan & Martojo
>>>> Kerklaan 5
>>>> 3645 ES Vinkeveen
>>>> Telefoon: 0297-266354
>>>> Mobiel: 06-81827665
>>>> Email: johan.haasnoot at dh-m.nl
>>>> Website: www.dehaanenmartojo.nl
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> * * * * * * * * * * * * * D I S C L A I M E R * * * * * * * * * *  
>>>> * * *
>>>> Dit bericht en eventuele bijlagen geven enkel de visie van de  
>>>> schrijver
>>>> weer en binden het INBO onder geen enkel beding, zolang dit  
>>>> bericht niet
>>>> bevestigd is door een geldig ondertekend document.
>>>> The views expressed in this message and any annex are purely  
>>>> those of the
>>>> writer and may not be regarded as stating an official position of  
>>>> INBO, as
>>>> long as the message is not confirmed by a duly signed document.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>> --
>> Dipl. Psych. Henrik Singmann
>> PhD Student
>> Albert-Ludwigs-Universität Freiburg, Germany
>> http://www.psychologie.uni-freiburg.de/Members/singmann
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> -- 
> Joshua Wiley
> Ph.D. Student, Health Psychology
> Programmer Analyst II, Statistical Consulting Group
> University of California, Los Angeles
> https://joshuawiley.com/
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Alameda, CA, USA



More information about the R-help mailing list