[R] Multiplication of regression coefficient by factor type variable

Sat Nov 17 22:58:26 CET 2018

You shouldn't have to do any of what you are doing.

See ?predict.lm  and note the "newdata" argument.

Also, you should spend some time studying a linear model text, as your
question appears to indicate some basic confusion (e.g. about
"contrasts" ) about how they work.

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Sat, Nov 17, 2018 at 1:24 PM Julian Righ Sampedro
<julian.righ.sampedro using gmail.com> wrote:
>
> Dear all,
>
> In a context of regression, I have three regressors, two of which are
> categorical variables (sex and education) and have class 'factor'.
>
> y = data$income
> x1 = as.factor(data$sex)  # 2 levels
> x2 = data$age  # continuous
> x3 = as.factor(data$ed)  # 8 levels
>
> for example, the first entries of x3 are
>
> head(x3)[1] 5 3 5 5 4 2
> Levels: 1 2 3 4 5 6 7 8
>
> When we call the model, the output looks like this
>
> model1=lm(y ~ x1 + x2 + x3, data = data)
> summary(model1)
>
> Residuals:
> Min     1Q Median     3Q    Max -31220  -6300   -594   4429 190731
>
> Coefficients:
>         Estimate Std. Error t value Pr(>|t|)    (Intercept)  1440.66
>  3809.99   0.378 0.705417
> x1          -4960.88     772.96  -6.418 2.13e-10 ***
> x2            181.45      25.03   7.249 8.41e-13 ***
> x32          2174.95    3453.22   0.630 0.528948
> x33          7497.68    3428.94   2.187 0.029004 *
> x34          8278.97    3576.30   2.315 0.020817 *
> x35         13686.88    3454.93   3.962 7.97e-05 ***
> x36         15902.92    4408.49   3.607 0.000325 ***
> x37         28773.13    3696.77   7.783 1.76e-14 ***
> x38         31455.55    5448.11   5.774 1.03e-08 ***---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 12060 on 1001 degrees of freedom
> Multiple R-squared:  0.2486,    Adjusted R-squared:  0.2418
> F-statistic: 36.79 on 9 and 1001 DF,  p-value: < 2.2e-16
>
> Now suppose I want to compute the residuals. To do so I first need to
> compute the prediction by the model. (I use it in a cross validation
> context so it is a partial display of the code)
>
> yhat1 = model1$coef[1] + model1$coef[2]*x1[i] + model1$coef[3]*x2[i] +
> model1$coef[4]*x3[i]
>
> But I get the following warnings
>
> Warning messages:1: In Ops.factor(model1$coef[2], x1[i]) : ‘*’ not
> meaningful for factors2: In Ops.factor(model1$coef[4], x3[i]) : ‘*’
> not meaningful for factors
>
> 1st question: Is there a way to multiply the coefficient by the 'factor'
> without having to transform my 'factor' into a 'numeric' type variable ?
>
> 2nd question: Since x3 is associated with 7 parameters (one for x32, one
> for x33, ... , one for x38), how do I multiply the 'correct' parameter
> coefficient with my 'factor' x3 ?
>
> I have been considering a 'if then' solution, but to no avail. I also have
> considered splitting my x3 variable into 8 binary variables without
> succeeding. What may be the best approach ? Thank you for your help.
>
> Since I understand this my not be specific enough, I add here the complete
> code
>
> # for n-fold cross validation# fit models on leave-one-out samples
> x1= as.factor(data$sex)
> x2= data$age
> x3= as.factor(data$ed)
> yn=data$income
> n = length(yn)
> e1 = e2 = numeric(n)
>
> for (i in 1:n) {
>   # the ith observation is excluded
>   y = yn[-i]
>   x_1 = x1[-i]
>   x_2 = x2[-i]
>   x_3 = x3[-i]
>   x_4 = as.factor(cf4)[-i]
>   # fit the first model without the ith observation
>   J1 = lm(y ~ x_1 + x_2 + x_3)
>   yhat1 = J1$coef[1] + J1$coef[2]*x1[i] + J1$coef[3]*x2[i] + J1$coef[4]*x3[i]
>   # construct the ith part of the loss function for model 1
>   e1[i] = yn[i] - yhat1
>   # fit the second model without the ith observation
>   J2 = lm(y ~ x_1 + x_2 + x_3  + x_4)
>   yhat2=J2$coef[1]+J2$coef[2]*x1[i]+J2$coef[3]*x2[i]+J2$coef[4]*x3[i]+J2$coef[5]*cf4[i]
>   e2[i] = yn[i] - yhat2
>  }
>  sqrt(c(mean(e1^2),mean(e2^2))) # RMSE
>
> cf4 is a variable corresponding to groups (using mixtures) . What we want
> to demonstrate is that the prediction error after cross-validation is lower
> when we include this latent grouping variable. It works wonders when the
> categorical variables are treated as 'numeric' variables. Though the ols
> estimates are obviously very different.
>
> Thank you in advance for your views on the problem.
>
> Best Regards,
>
> julian
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.