[R] Collinearity in Moderated Multiple Regression

Liaw, Andy andy_liaw at merck.com
Wed Aug 4 15:38:26 CEST 2010


Seems to me it may be worth stating what may be elementary to some on this list:

- If all relevant variables are included in the model and the "true model" is indeed linear, then all least squares estimated coefficients are unbiased.  [ David Ruppert once said about the three kinds of lies:  Lies, damn lies, and Y~N(Xb, s^2). ]

- If some variables with non-zero "true coefficients" are omitted in the fitted model, the estimated coefficients of those variables in the model may be biased, with the exception when the omitted variables are orthogonal to those in the model (i.e., 0 correlations).

- If x1 and x2 are correlated, you'd have a tough enough time separating their effects on y, let alone trying to assess their interaction effect on y.  

Andy

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
Sent: Tuesday, August 03, 2010 4:52 PM
To: Michael Haenlein
Cc: r-help at r-project.org
Subject: Re: [R] Collinearity in Moderated Multiple Regression

"biased regression coefficients" is nonsense.  The coefficients are
unbiased: their expectation (in the appropriate model) is the true
value of the parameters (when estimated by, e.g. least squares).

The problem is model selection. I suggest you consult a local
statistician, as you seem confused about the basic concepts.

Bert Gunter
Genentech Nonclinical Biostatistics



On Tue, Aug 3, 2010 at 1:42 PM, Michael Haenlein <haenlein at escpeurope.eu> wrote:
> Thanks for all your comments!
>
> @Dennis: Are there any thresholds that I can use to evaluate the Variance
> Inflation Factor? I think I learned at some point that VIF should be less
> than 10, but probably that is too conservative? You mentioned in your
> example that a VIF of 13 is "not big enough to raise a red flag". So is the
> cut-off more around 15 or 20?
>
> @Bert: The purpose of my regression is inference, that is to know whether
> and to which extent x1, x2 and x1*x2 influence y. It's less about prediction
> than about understanding the relative impact of different variables. So, if
> I get your message correctly, correlation among the predictors is likely to
> be an issue in my case as it leads to biased regression coefficients (which
> is what I feared).
>
> Thanks,
>
> Michael
>
>
>
> -----Original Message-----
> From: Bert Gunter [mailto:gunter.berton at gene.com]
> Sent: Tuesday, August 03, 2010 22:37
> To: Dennis Murphy
> Cc: haenlein at gmail.com; r-help at r-project.org
> Subject: Re: [R] Collinearity in Moderated Multiple Regression
>
> Absolutely right.
>
> But I think it's also worth adding that when the predictors _are_
> correlated, the estimates of their coefficients depend on which are included
> in the model. This means that one should generally not try to interpret the
> individual coefficients, e.g. as a way to assess their relative importance.
> Rather, they should just be viewed as the machinery that produces the
> prediction surface, and that is what one needs to consider to understand the
> model.
>
> In my experience, this elementary fact is not understood by many
> (most?) nonstatistical practicioners using multiple regression -- and this
> ignorance gets them into a world of trouble.
>
> -- Bert
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
>
> On Tue, Aug 3, 2010 at 12:57 PM, Dennis Murphy <djmuser at gmail.com> wrote:
>>
>> Hi:
>>
>> On Tue, Aug 3, 2010 at 6:51 AM, <haenlein at gmail.com> wrote:
>>
>> > I'm sorry -- I think I chose a bad example. Let me start over again:
>> >
>> > I want to estimate a moderated regression model of the following form:
>> > y = a*x1 + b*x2 + c*x1*x2 + e
>> >
>>
>> No intercept? What's your null model, then?
>>
>>
>> >
>> > Based on my understanding, including an interaction term (x1*x2)
>> > into the regression in addition to x1 and x2 leads to issues of
>> > multicollinearity, as x1*x2 is likely to covary to some degree with x1
> (and x2).
>>
>>
>> Is it possible you're confusing interaction with multicollinearity?
>> You've stated that x1 and x2 are weakly correlated;  the product term
>> is going to be correlated with each of its constituent covariates, but
>> unless that correlation is above 0.9 (some say 0.95) in magnitude,
>> multicollinearity is not really a substantive issue. As others have
>> suggested, if you're concerned about multicollinearity, then fit the
>> interaction model and use the vif() function from package car or elsewhere
> to check for it.
>> Multicollinearity has to do with ill-conditioning in the model matrix;
>> interaction means that the response y is influenced by the product of
>> x1 and
>> x2 covariates as well as the individual covariates. They are not the
>> same thing. Perhaps an example will help.
>>
>> Here's your x1 and x2 with a manufactured response:
>>
>> df <- data.frame(x1 = rep(1:3, each = 3),
>>                  x2 = rep(1:3, 3))
>> df$y <- 0.5 + df$x1 + 1.2 * df$x2 + 2.5 * df$x1 * df$x2 + rnorm(9) #
>> Response is generated to produce a significant interaction df
>>  x1 x2         y
>> 1  1  1  5.968255
>> 2  1  2  7.566212
>> 3  1  3 13.420006
>> 4  2  1  9.025791
>> 5  2  2 16.382381
>> 6  2  3 20.923113
>> 7  3  1 11.669916
>> 8  3  2 20.714224
>> 9  3  3 31.757423
>>
>> m1 <- lm(y ~ x1 * x2, data = df)
>> > summary(m1)
>>     <snip>
>>
>> Coefficients:
>>            Estimate Std. Error t value Pr(>|t|)
>> (Intercept)   2.3642     2.6214   0.902  0.40846
>> x1           -0.1200     1.2135  -0.099  0.92505
>> x2            0.2549     1.2135   0.210  0.84193
>> x1:x2         3.1589     0.5617   5.624  0.00246 **
>> ---
>> Residual standard error: 1.123 on 5 degrees of freedom Multiple
>> R-squared: 0.9882,     Adjusted R-squared: 0.9812
>> F-statistic: 139.9 on 3 and 5 DF,  p-value: 3.053e-05
>>
>> # So the model has insignificant marginal covariate effects but a
>> strong interaction effect.
>>
>> library(car)
>> vif(m1)
>>   x1    x2 x1:x2
>>    7     7    13
>>
>> # None of these is big enough to raise a red flag # re
>> multicollinearity. Let's look at the correlation # matrix of the two
>> covariates and their interaction.
>>
>> with(df, cor(cbind(x1, x2, x1 * x2)))
>>          x1        x2
>> x1 1.0000000 0.0000000 0.6793662
>> x2 0.0000000 1.0000000 0.6793662
>>   0.6793662 0.6793662 1.0000000
>>
>> The correlation of the interaction with the other two covariates is
>> 0.68, which is nowhere close to the 0.9 or above correlations that
>> signal potential multicollinearity.
>>
>> HTH,
>> Dennis
>>
>>
>> One
>> > recommendation I have seen in this context is to use mean centering,
>> > but apparently this does not solve the problem (see: Echambadi, Raj
>> > and James D. Hess (2007), "Mean-centering does not alleviate
>> > collinearity problems in moderated multiple regression models,"
>> > Marketing science, 26 (3), 438 - 45). So my question is: Which R
>> > function can I use to estimate this type of model.
>> >
>>
>> > Sorry for the confusion caused due to my previous message,
>> >
>> > Michael
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Aug 3, 2010 3:42pm, David Winsemius <dwinsemius at comcast.net> wrote:
>> > > I think you are attributing to "collinearity" a problem that is
>> > > due to your small sample size. You are predicting 9 points with 3
>> > > predictor terms, and incorrectly concluding that there is some
> "inconsistency"
>> > > because you get an R^2 that is above some number you deem
>> > > surprising. (I got values between 0.2 and 0.4 on several runs.
>> >
>> >
>> >
>> > > Try:
>> >
>> > > x1
>> > > x2
>> > > x3
>> >
>> >
>> > > y
>> > > model
>> > > summary(model)
>> >
>> >
>> >
>> > > # Multiple R-squared: 0.04269
>> >
>> >
>> >
>> > > --
>> >
>> > > David.
>> >
>> >
>> >
>> > > On Aug 3, 2010, at 9:10 AM, Michael Haenlein wrote:
>> >
>> >
>> >
>> >
>> > > Dear all,
>> >
>> >
>> >
>> > > I have one dependent variable y and two independent variables x1
>> > > and x2
>> >
>> > > which I would like to use to explain y. x1 and x2 are design
>> > > factors in
>> > an
>> >
>> > > experiment and are not correlated with each other. For example
>> > > assume
>> > > that:
>> >
>> >
>> >
>> > > x1
>> > > x2
>> > > cor(x1,x2)
>> >
>> >
>> >
>> > > The problem is that I do not only want to analyze the effect of x1
>> > > and x2 on
>> >
>> > > y but also of their interaction x1*x2. Evidently this interaction
>> > > term has a
>> >
>> > > substantial correlation with both x1 and x2:
>> >
>> >
>> >
>> > > x3
>> > > cor(x1,x3)
>> >
>> > > cor(x2,x3)
>> >
>> >
>> >
>> > > I therefore expect that a simple regression of y on x1, x2 and
>> > > x1*x2 will
>> >
>> > > lead to biased results due to multicollinearity. For example, even
>> > > when y is
>> >
>> > > completely random and unrelated to x1 and x2, I obtain a
>> > > substantial R2 for
>> >
>> > > a simple linear model which includes all three variables. This
>> > > evidently
>> >
>> > > does not make sense:
>> >
>> >
>> >
>> > > y
>> > > model
>> > > summary(model)
>> >
>> >
>> >
>> > > Is there some function within R or in some separate library that
>> > > allows
>> > me
>> >
>> > > to estimate such a regression without obtaining inconsistent results?
>> >
>> >
>> >
>> > > Thanks for your help in advance,
>> >
>> >
>> >
>> > > Michael
>> >
>> >
>> >
>> >
>> >
>> > > Michael Haenlein
>> >
>> > > Associate Professor of Marketing
>> >
>> > > ESCP Europe
>> >
>> > > Paris, France
>> >
>> >
>> >
>> > > [[alternative HTML version deleted]]
>> >
>> >
>> >
>> > > ______________________________________________
>> >
>> > > R-help at r-project.org mailing list
>> >
>> > > https://stat.ethz.ch/mailman/listinfo/r-help
>> >
>> > > PLEASE do read the posting guide
>> > > http://www.R-project.org/posting-guide.html
>> >
>> > > and provide commented, minimal, self-contained, reproducible code.
>> >
>> >
>> >
>> >
>> > > David Winsemius, MD
>> >
>> > > West Hartford, CT
>> >
>> >
>> >
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}



More information about the R-help mailing list