[R] lm with a single X and step with several Xi-s, beta coef. quite different:

Aldi Kraja aldi at wustl.edu
Tue Aug 7 21:28:30 CEST 2012


Hi, (R version 2.15.0)
I am running a pgm with 1 response (earlier standardized Y) and 44 
independent vars (Xi) from the same data =a2:
When I run the 'lm' function on single Xi at a time, the beta 
coefficient for let's say X1 is = -0.08 (se=0.03256)
But when I run the same Y with 44 Xi-s with the 'step' function (because 
I left direction parameter empty, I assume a backward multiple reg is 
implemented), 12 Xia-a remain in the final model where X1 is still 
present, the X1 beta coefficient becomes = --0.43402 (se=0.06847)

I did not expect such a drastic change (4 times smaller) in the beta 
coeff. from "lm" with X1 (bx1=-0.08) to "step" with final 12 Xis 
including X1 (bx1=--0.43402).
I understand that step function is producing partial reg coeff, when all 
other Xi-s are held constant, but is there any good reason why X1 in a 
multivariate reg. can become so significant (from lm px1=0.00296 ** to 
step px1=2.55e-10 ***)?

Some of the 44 Xi-s are correlated to each other, but I am hoping that 
stepwise reg will drop some of those correlated ones.
The Xi-s represent variables coded numerically as 0,1,2 to apply a 
linear regression on them.
For example the frequency of X1 is:
[1] x1
Levels: x1
0 1 2
3459 985 96

output of lm(Y ~ X1):
==================
 > obj1<-lm(y ~ x1, data=a2)
 > summary(obj1)

Call:
lm(formula = y ~ x1, data = a2)

Residuals:
Min 1Q Median 3Q Max
-3.3418 -0.7240 -0.0462 0.6577 4.2929

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03635 0.01781 2.042 0.04124 *
x1 -0.09682 0.03256 -2.973 0.00296 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.024 on 4255 degrees of freedom
Multiple R-squared: 0.002074, Adjusted R-squared: 0.001839
F-statistic: 8.842 on 1 and 4255 DF, p-value: 0.002961

output from the step function on 44 Xi-s:
====================================
a2 <-na.omit(ac16g761[,3:(44+2+1)])
lm.a2<-lm(y ~ ., data=a2)
lm.final <-step(lm.a2,trace=F)
summary(lm.final)
Call:
lm(formula = y ~ x1 + x2 +
x3 + x4 + x5 + x6 + x7 + x8 +
x9 + x10 + x11 + x12, data = a2)

Residuals:
Min 1Q Median 3Q Max
-3.2955 -0.7210 -0.0611 0.6623 4.1064

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01065 0.02637 0.404 0.686412
x1 -0.43402 0.06847 -6.339 2.55e-10 ***
x2 -0.17109 0.11370 -1.505 0.132464
x3 0.23552 0.11552 2.039 0.041533 *
x4 -0.19898 0.10133 -1.964 0.049625 *
x5 0.06653 0.03796 1.752 0.079769 .
x6 0.18319 0.08592 2.132 0.033070 *
x7 -0.17443 0.05095 -3.424 0.000624 ***
x8 0.24013 0.06516 3.685 0.000232 ***
x9 0.19202 0.08009 2.398 0.016543 *
x10 -0.17257 0.05576 -3.095 0.001983 **
x11 -0.23537 0.05704 -4.126 3.75e-05 ***
x12 0.25992 0.06260 4.152 3.35e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.02 on 4244 degrees of freedom
Multiple R-squared: 0.01353, Adjusted R-squared: 0.01074
F-statistic: 4.851 on 12 and 4244 DF, p-value: 5.466e-08

Thank you in advance,

Aldi

P.S. Sorry that I cannot distribute these data for a test.

-- 



More information about the R-help mailing list