[R] Collinearity in Moderated Multiple Regression

Tue Aug 3 22:52:19 CEST 2010

"biased regression coefficients" is nonsense.  The coefficients are
unbiased: their expectation (in the appropriate model) is the true
value of the parameters (when estimated by, e.g. least squares).

The problem is model selection. I suggest you consult a local
statistician, as you seem confused about the basic concepts.

Bert Gunter
Genentech Nonclinical Biostatistics

On Tue, Aug 3, 2010 at 1:42 PM, Michael Haenlein <haenlein at escpeurope.eu> wrote:
> Thanks for all your comments!
>
> @Dennis: Are there any thresholds that I can use to evaluate the Variance
> Inflation Factor? I think I learned at some point that VIF should be less
> than 10, but probably that is too conservative? You mentioned in your
> example that a VIF of 13 is "not big enough to raise a red flag". So is the
> cut-off more around 15 or 20?
>
> @Bert: The purpose of my regression is inference, that is to know whether
> and to which extent x1, x2 and x1*x2 influence y. It's less about prediction
> than about understanding the relative impact of different variables. So, if
> I get your message correctly, correlation among the predictors is likely to
> be an issue in my case as it leads to biased regression coefficients (which
> is what I feared).
>
> Thanks,
>
> Michael
>
>
>
> -----Original Message-----
> From: Bert Gunter [mailto:gunter.berton at gene.com]
> Sent: Tuesday, August 03, 2010 22:37
> To: Dennis Murphy
> Cc: haenlein at gmail.com; r-help at r-project.org
> Subject: Re: [R] Collinearity in Moderated Multiple Regression
>
> Absolutely right.
>
> But I think it's also worth adding that when the predictors _are_
> correlated, the estimates of their coefficients depend on which are included
> in the model. This means that one should generally not try to interpret the
> individual coefficients, e.g. as a way to assess their relative importance.
> Rather, they should just be viewed as the machinery that produces the
> prediction surface, and that is what one needs to consider to understand the
> model.
>
> In my experience, this elementary fact is not understood by many
> (most?) nonstatistical practicioners using multiple regression -- and this
> ignorance gets them into a world of trouble.
>
> -- Bert
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
>
> On Tue, Aug 3, 2010 at 12:57 PM, Dennis Murphy <djmuser at gmail.com> wrote:
>>
>> Hi:
>>
>> On Tue, Aug 3, 2010 at 6:51 AM, <haenlein at gmail.com> wrote:
>>
>> > I'm sorry -- I think I chose a bad example. Let me start over again:
>> >
>> > I want to estimate a moderated regression model of the following form:
>> > y = a*x1 + b*x2 + c*x1*x2 + e
>> >
>>
>> No intercept? What's your null model, then?
>>
>>
>> >
>> > Based on my understanding, including an interaction term (x1*x2)
>> > into the regression in addition to x1 and x2 leads to issues of
>> > multicollinearity, as x1*x2 is likely to covary to some degree with x1
> (and x2).
>>
>>
>> Is it possible you're confusing interaction with multicollinearity?
>> You've stated that x1 and x2 are weakly correlated;  the product term
>> is going to be correlated with each of its constituent covariates, but
>> unless that correlation is above 0.9 (some say 0.95) in magnitude,
>> multicollinearity is not really a substantive issue. As others have
>> suggested, if you're concerned about multicollinearity, then fit the
>> interaction model and use the vif() function from package car or elsewhere
> to check for it.
>> Multicollinearity has to do with ill-conditioning in the model matrix;
>> interaction means that the response y is influenced by the product of
>> x1 and
>> x2 covariates as well as the individual covariates. They are not the
>> same thing. Perhaps an example will help.
>>
>> Here's your x1 and x2 with a manufactured response:
>>
>> df <- data.frame(x1 = rep(1:3, each = 3),
>>                  x2 = rep(1:3, 3))
>> df$y <- 0.5 + df$x1 + 1.2 * df$x2 + 2.5 * df$x1 * df$x2 + rnorm(9) #
>> Response is generated to produce a significant interaction df
>>  x1 x2         y
>> 1  1  1  5.968255
>> 2  1  2  7.566212
>> 3  1  3 13.420006
>> 4  2  1  9.025791
>> 5  2  2 16.382381
>> 6  2  3 20.923113
>> 7  3  1 11.669916
>> 8  3  2 20.714224
>> 9  3  3 31.757423
>>
>> m1 <- lm(y ~ x1 * x2, data = df)
>> > summary(m1)
>>     <snip>
>>
>> Coefficients:
>>            Estimate Std. Error t value Pr(>|t|)
>> (Intercept)   2.3642     2.6214   0.902  0.40846
>> x1           -0.1200     1.2135  -0.099  0.92505
>> x2            0.2549     1.2135   0.210  0.84193
>> x1:x2         3.1589     0.5617   5.624  0.00246 **
>> ---
>> Residual standard error: 1.123 on 5 degrees of freedom Multiple
>> R-squared: 0.9882,     Adjusted R-squared: 0.9812
>> F-statistic: 139.9 on 3 and 5 DF,  p-value: 3.053e-05
>>
>> # So the model has insignificant marginal covariate effects but a
>> strong interaction effect.
>>
>> library(car)
>> vif(m1)
>>   x1    x2 x1:x2
>>    7     7    13
>>
>> # None of these is big enough to raise a red flag # re
>> multicollinearity. Let's look at the correlation # matrix of the two
>> covariates and their interaction.
>>
>> with(df, cor(cbind(x1, x2, x1 * x2)))
>>          x1        x2
>> x1 1.0000000 0.0000000 0.6793662
>> x2 0.0000000 1.0000000 0.6793662
>>   0.6793662 0.6793662 1.0000000
>>
>> The correlation of the interaction with the other two covariates is
>> 0.68, which is nowhere close to the 0.9 or above correlations that
>> signal potential multicollinearity.
>>
>> HTH,
>> Dennis
>>
>>
>> One
>> > recommendation I have seen in this context is to use mean centering,
>> > but apparently this does not solve the problem (see: Echambadi, Raj
>> > and James D. Hess (2007), "Mean-centering does not alleviate
>> > collinearity problems in moderated multiple regression models,"
>> > Marketing science, 26 (3), 438 - 45). So my question is: Which R
>> > function can I use to estimate this type of model.
>> >
>>
>> > Sorry for the confusion caused due to my previous message,
>> >
>> > Michael
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Aug 3, 2010 3:42pm, David Winsemius <dwinsemius at comcast.net> wrote:
>> > > I think you are attributing to "collinearity" a problem that is
>> > > due to your small sample size. You are predicting 9 points with 3
>> > > predictor terms, and incorrectly concluding that there is some
> "inconsistency"
>> > > because you get an R^2 that is above some number you deem
>> > > surprising. (I got values between 0.2 and 0.4 on several runs.
>> >
>> >
>> >
>> > > Try:
>> >
>> > > x1
>> > > x2
>> > > x3
>> >
>> >
>> > > y
>> > > model
>> > > summary(model)
>> >
>> >
>> >
>> > > # Multiple R-squared: 0.04269
>> >
>> >
>> >
>> > > --
>> >
>> > > David.
>> >
>> >
>> >
>> > > On Aug 3, 2010, at 9:10 AM, Michael Haenlein wrote:
>> >
>> >
>> >
>> >
>> > > Dear all,
>> >
>> >
>> >
>> > > I have one dependent variable y and two independent variables x1
>> > > and x2
>> >
>> > > which I would like to use to explain y. x1 and x2 are design
>> > > factors in
>> > an
>> >
>> > > experiment and are not correlated with each other. For example
>> > > assume
>> > > that:
>> >
>> >
>> >
>> > > x1
>> > > x2
>> > > cor(x1,x2)
>> >
>> >
>> >
>> > > The problem is that I do not only want to analyze the effect of x1
>> > > and x2 on
>> >
>> > > y but also of their interaction x1*x2. Evidently this interaction
>> > > term has a
>> >
>> > > substantial correlation with both x1 and x2:
>> >
>> >
>> >
>> > > x3
>> > > cor(x1,x3)
>> >
>> > > cor(x2,x3)
>> >
>> >
>> >
>> > > I therefore expect that a simple regression of y on x1, x2 and
>> > > x1*x2 will
>> >
>> > > lead to biased results due to multicollinearity. For example, even
>> > > when y is
>> >
>> > > completely random and unrelated to x1 and x2, I obtain a
>> > > substantial R2 for
>> >
>> > > a simple linear model which includes all three variables. This
>> > > evidently
>> >
>> > > does not make sense:
>> >
>> >
>> >
>> > > y
>> > > model
>> > > summary(model)
>> >
>> >
>> >
>> > > Is there some function within R or in some separate library that
>> > > allows
>> > me
>> >
>> > > to estimate such a regression without obtaining inconsistent results?
>> >
>> >
>> >
>> > > Thanks for your help in advance,
>> >
>> >
>> >
>> > > Michael
>> >
>> >
>> >
>> >
>> >
>> > > Michael Haenlein
>> >
>> > > Associate Professor of Marketing
>> >
>> > > ESCP Europe
>> >
>> > > Paris, France
>> >
>> >
>> >
>> > > [[alternative HTML version deleted]]
>> >
>> >
>> >
>> > > ______________________________________________
>> >
>> > > R-help at r-project.org mailing list
>> >
>> > > https://stat.ethz.ch/mailman/listinfo/r-help
>> >
>> > > PLEASE do read the posting guide
>> > > http://www.R-project.org/posting-guide.html
>> >
>> > > and provide commented, minimal, self-contained, reproducible code.
>> >
>> >
>> >
>> >
>> > > David Winsemius, MD
>> >
>> > > West Hartford, CT
>> >
>> >
>> >
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>