[R] How to use PC1 of PCA and dim1 of MCA as a predictor in logistic regression model for data reduction

khosoda at med.kobe-u.ac.jp khosoda at med.kobe-u.ac.jp
Wed Aug 17 17:12:10 CEST 2011


Hi all,

I'm trying to do model reduction for logistic regression. I have 13
predictor (4 continuous variables and 9 binary variables). Using subject
matter knowledge, I selected 4 important variables. Regarding the rest 9
variables, I tried to perform data reduction by principal component
analysis (PCA). However, 8 of 9 variables were binary and only one
continuous. I transformed the data by transcan of rms package and did
PCA with princomp. PC1 explained only 20% of the variance. Still, I used
the PC1 as a predictor of the logistic model and obtained some results.

Then, I tried multiple correspondence analysis (MCA). The only one
continuous variable was age. I transformed "age" variable to "age_Q"
factor variable as the followings.

> quantile(mydata.df$age)
   0%   25%   50%   75%  100%
53.00 66.75 72.00 76.25 85.00
> age_Q <- cut(x17.df$age, right=TRUE, breaks=c(-Inf, 66, 72, 76, Inf),
labels=c("53-66", "67-72", "73-76", "77-85"))
> table(age_Q)
age_Q
53-66 67-72 73-76 77-85
   26    27    25    26

Then, I used mjca of ca pacakge for MCA.

> mjca1 <-  mjca(mydata.df[, c("age_Q","sex","symptom", "HT", "DM",
"IHD","smoking","DL", "Statin")])

> summary(mjca1)

Principal inertias (eigenvalues):

 dim    value      %   cum%   scree plot
 1      0.009592  43.4  43.4  *************************
 2      0.003983  18.0  61.4  **********
 3      0.001047   4.7  66.1  **
 4      0.000367   1.7  67.8
        -------- -----
 Total: 0.022111

The dimension 1 explained 43% of the variance. Then, I was wondering
which values I could use like PC1 in PCA. I explored in mjca1 and found
"rowcoord".

> mjca1$rowcoord
              [,1]          [,2]        [,3]         [,4]
  [1,]  0.07403748  0.8963482181  0.10828273  1.581381849
  [2,]  0.92433996 -1.1497911361  1.28872517  0.304065865
  [3,]  0.49833354  0.6482940556 -2.11114314  0.365023261
  [4,]  0.18998290 -1.4028117048 -1.70962159  0.451951744
  [5,] -0.13008173  0.2557656854  1.16561601 -1.012992485
.........................................................
.........................................................
[101,] -1.86940216  0.5918128751  0.87352987 -1.118865117
[102,] -2.19096615  1.2845448725  0.25227354 -0.938612155
[103,]  0.77981265 -1.1931087587  0.23934034  0.627601413
[104,] -2.37058237 -1.4014005013 -0.73578248 -1.455055095

Then, I used mjca1$rowcoord[, 1] as the followings.

> mydata.df$NewScore <- mjca1$rowcoord[, 1]

I used this "NewScore" as one of the predictors for the model instead of
original 9 variables.

The final logistic model obtained by use of MCA was similar to the one
obtained by use of PCA.

My questions are;

1. Is it O.K. to perform PCA for data consisting of 1 continuous
variable and 8 binary variables?

2. Is it O.K to perform transformation of age from continuous variable
to factor variable for MCA?

3. Is "mjca1$rowcoord[, 1]" the correct values as a predictor of
logistic regression model like PC1 of PCA?

I would appreciate your help in advance.

--
Kohkichi Hosoda



More information about the R-help mailing list