[R] coxph won't converge when including categorical (factor) variables
E Joffe
ejoffe at hotmail.com
Sun Jul 7 17:01:52 CEST 2013
Hello all,
I have posted a reply to Mark and Jeff out of my own fault sent it to them personally and not the group.
I will re-post the question and Mark's answer so that the community can benefit from his comments.
It seems that my posts are not in accordance with the guidelines and I will try to remedy that as well.
I apologize for the mess I have made !!!
I am afraid I can't post the actual data (as it is Protected Health Information).
In any case here are the question, code, output and Mark's comment (which can be summarized that there is very little information about the pec package within the community and that the best solution would be to try and contact the author of the package - which I will and then repost the answer):
I am attempting to evaluate the prediction error of a coxph model that was built after feature selection with glmnet.
In the preprocessing stage I used na.omit (dataset) to remove NAs. I reconstructed all my factor variables into binary variables with dummies (using model.matrix) I then used glmnet lasso to fit a cox model and select the best performing features. Then I fit a coxph model using only these feature.
When I try to evaluate the model using pec and a bootstrap I get an error that the prediction matrix has wrong dimensions.
In the original dataset there are 313 variables. In the coxph model (glmnet.cox in my code) there are 313 df. However when I run pec the prediction matrix is noted as having 318 variables and therefore doesn't fit the testset which has 313 variables (+the status and time variables).
It seems that pec does something to the variables while retraining the coxph model on the bootstrap samples. I tried setting singular.ok=FALSE in the coxph model without avail.
Here is a summary of key variables (prior to restructuring for glmnet) Followed by the code I ran and the errors:
> summary (trainSet)
time status Age_at_Dx CREATININE
Min. : 0.1429 Min. :0.0000 Min. :14.81 Min. :0.400
1st Qu.: 22.0714 1st Qu.:1.0000 1st Qu.:52.51 1st Qu.:0.800
Median : 52.9286 Median :1.0000 Median :63.79 Median :0.900
Mean : 96.5415 Mean :0.7826 Mean :61.01 Mean :1.042
3rd Qu.:154.6071 3rd Qu.:1.0000 3rd Qu.:73.01 3rd Qu.:1.125
Max. :437.7143 Max. :1.0000 Max. :87.23 Max. :6.000
Performance_Status ALBUMIN Cyto WBC
Min. :0.0000 Min. :0.70 Min. : 1.000 Min. : 0.60
1st Qu.:1.0000 1st Qu.:3.00 1st Qu.: 4.000 1st Qu.: 3.20
Median :1.0000 Median :3.60 Median : 4.000 Median : 9.20
Mean :0.9728 Mean :3.48 Mean : 6.802 Mean : 28.35
3rd Qu.:1.0000 3rd Qu.:4.00 3rd Qu.:11.000 3rd Qu.: 33.10
Max. :4.0000 Max. :5.20 Max. :17.000 Max. :373.00
PRKCD_pT507 HGB Maxblast CD19
Min. :-2.429605 Min. : 5.400 Min. : 0.00 Min. : 0.000
1st Qu.:-0.627005 1st Qu.: 8.500 1st Qu.:24.00 1st Qu.: 1.000
Median :-0.013117 Median : 9.550 Median :40.00 Median : 2.000
Mean : 0.006432 Mean : 9.689 Mean :43.74 Mean : 6.312
3rd Qu.: 0.559782 3rd Qu.:10.800 3rd Qu.:65.25 3rd Qu.: 5.000
Max. : 2.963658 Max. :14.300 Max. :98.00 Max. :98.000
GAPDH CD74 TP53 Fli1
Min. :-2.391021 Min. :-1.7902 Min. :-1.64244 Min. :-3.723143
1st Qu.:-0.662164 1st Qu.:-0.5502 1st Qu.:-0.53685 1st Qu.:-0.559783
Median : 0.003236 Median :-0.1546 Median :-0.10928 Median :-0.002154
Mean : 0.015759 Mean : 0.0235 Mean :-0.02285 Mean :-0.016555
3rd Qu.: 0.632472 3rd Qu.: 0.4759 3rd Qu.: 0.29869 3rd Qu.: 0.554521
Max. : 3.512741 Max. : 3.7226 Max. : 5.39242 Max. : 4.415324
LDH SQSTM0 BM_BLAST CCND3
Min. : 8.4 Min. :-1.56639 Min. : 0.00 Min. :-2.44772
1st Qu.: 562.8 1st Qu.:-0.48762 1st Qu.:31.00 1st Qu.:-0.59042
Median : 952.5 Median :-0.05746 Median :46.50 Median :-0.08412
Mean : 1617.9 Mean :-0.02272 Mean :50.64 Mean :-0.02506
3rd Qu.: 1596.8 3rd Qu.: 0.33718 3rd Qu.:72.00 3rd Qu.: 0.48353
Max. :36708.0 Max. : 3.59456 Max. :98.00 Max. : 3.58761
GAB2 BAX ABS_BLST PRKAA1_2
Min. :-3.201564 Min. :-3.230737 Min. : 0.0 Min. :-1.90671
1st Qu.:-0.640279 1st Qu.:-0.593174 1st Qu.: 99.8 1st Qu.:-0.66896
Median : 0.007018 Median :-0.106097 Median : 1836.5 Median :-0.10586
Mean :-0.013284 Mean :-0.007051 Mean : 15024.1 Mean :-0.03531
3rd Qu.: 0.635609 3rd Qu.: 0.588098 3rd Qu.: 11178.0 3rd Qu.: 0.44906
Max. : 2.396419 Max. : 3.439942 Max. :358080.0 Max. : 3.60037
RPS6KB1 SPP1 MAPT ARC
Min. :-2.05490 Min. :-1.39728 Min. :-1.826101 Min. :-2.67081
1st Qu.:-0.65736 1st Qu.:-0.55249 1st Qu.:-0.573470 1st Qu.:-0.63908
Median :-0.12007 Median :-0.15156 Median :-0.046306 Median :-0.07845
Mean :-0.06115 Mean : 0.02759 Mean : 0.009604 Mean :-0.01082
3rd Qu.: 0.49055 3rd Qu.: 0.35195 3rd Qu.: 0.495406 3rd Qu.: 0.63638
Max. : 3.54709 Max. : 4.44919 Max. : 4.063568 Max. : 2.52810
FOXO1_pT24_FOXO3_pT32 ATG7 SMAD5 RPS6_pS235_236
Min. :-2.2506 Min. :-2.916018 Min. :-2.321250 Min. :-3.229642
1st Qu.:-0.5590 1st Qu.:-0.621408 1st Qu.:-0.480924 1st Qu.:-0.633143
Median :-0.1281 Median : 0.003795 Median : 0.003186 Median :-0.003345
Mean :-0.0121 Mean :-0.029199 Mean : 0.010663 Mean :-0.015028
3rd Qu.: 0.4415 3rd Qu.: 0.590721 3rd Qu.: 0.502824 3rd Qu.: 0.660224
Max. : 3.1239 Max. : 2.757525 Max. : 2.238310 Max. : 2.816105
LSD1 HDAC2 SFN
Min. :-4.5434 Min. :-2.40682 Min. :-2.082592
1st Qu.:-0.5654 1st Qu.:-0.49165 1st Qu.:-0.554862
Median : 0.0477 Median : 0.02828 Median :-0.116248
Mean :-0.0165 Mean : 0.04641 Mean : 0.006255
3rd Qu.: 0.6044 3rd Qu.: 0.47826 3rd Qu.: 0.506527
Max. : 1.8736 Max. : 4.22724 Max. : 3.733445
library (survival)
library (pec)
library (glmnet)
##----------------------------------------------------------------------------------- -------------------
## FEATURE SELECTION BY GLMNET-LASSO
##---------------------------------------------------------------------------------- ---------------------
##
## This function takes a predictor matrix and a Surv obj and uses the glmnet lasso regularization
## method to select the most predictive features and create a coxph object
## predict_matrix is the 'dataset' reformated to replace categorical variables to binary with dummy
#creat Y (survival matrix) for glmnet
surv_obj <- Surv(trainSet$time,trainSet$status)
## tranform categorical variables into binary variables with dummy for trainSet
predict_matrix <- model.matrix(~ ., data=trainSet,
contrasts.arg = lapply (trainSet[,sapply(trainSet, is.factor)], contrasts, contrasts=FALSE))
## remove the status/time variables from the predictor matrix (x) for glmnet
predict_matrix <- subset (predict_matrix, select=c(-time,-status))
## create a glmnet cox object using lasso regularization
glmnet.obj <- glmnet (predict_matrix, surv_obj, family="cox")
# find lambda for which dev.ratio is max
max.dev.index <- which.max(glmnet.obj$dev.ratio)
optimal.lambda <- glmnet.obj$lambda[max.dev.index]
# take beta for optimal lambda
optimal.beta <- glmnet.obj$beta[,max.dev.index]
# find non zero beta coef
nonzero.coef <- abs(optimal.beta)>0
selectedBeta <- optimal.beta[nonzero.coef]
# take only covariates for which beta is not zero
selectedVar <- predict_matrix[,nonzero.coef]
# create a dataframe for trainSet with time, status and selected variables in binary representation for evaluation in pec
reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))
# create coxph object with pre-defined coefficients
glmnet.cox <- coxph(surv_obj ~ selectedVar,data = reformat_dataSet,model=TRUE, singular.ok=FALSE,init=selectedBeta,iter=0)
##------------------------------------------------------------------------------------ -------------------
## MODEL PERFORMANCE
##----------------------------------------------------------------------------------- --------------------
##
set.seed(17743)
pec.f <- as.formula(Hist(time,status) ~ 1)
## Brier score for the cox-glmnet model
brierGlmnet <- pec(list(glmnet.cox), data= reformat_dataSet , splitMethod="BootCV", B=50)
Here is the result:
Prediction matrix has wrong dimensions:
368 rows and 318 columns. --->(but there are only 313 predicting vars in the dataset)
But requested are predicted probabilities for
138 subjects (rows) in newdata and 315 time points (columns) --- (315=313 predicting vars+status+time)
This may happen when some covariate values are missing in newdata!?
Warning in FUN(1:2[[2L]], ...) :
Oddly, when I call print on the glmnet.cox object I get that it has 313 variables and not as pec outputs (318) print(glmnet.cox) Likelihood ratio test=0 on 313 df, p=1 n= 368, number of events= 288
Here is Mark's response:
I suspect that you are going to need to contact the author/maintainer of the pec package for detailed assistance. From a search of the R list archives, there are not that many posts pertaining to the package, which is generally an indication of the breadth of usage within the community. The lack of relevant posts would likely correlate to the low number of users who may be in a position to assist you. Thus, the package author/maintainer in this case, will likely be the best and most expedient resource.
I hope that you find the above useful in some manner.
Regards,
Marc
-----Original Message-----
From: Marc Schwartz [mailto:marc_schwartz at me.com]
Sent: Saturday, July 06, 2013 8:46 AM
To: E Joffe
Cc: r-help at r-project.org
Subject: Re: [R] coxph won't converge when including categorical (factor) variables
On Jul 6, 2013, at 7:04 AM, E Joffe <ejoffe at hotmail.com> wrote:
> Hello,
>
>
>
> [rephrasing and reposting of a previous question (that was not
> answered) with new information]
>
>
>
> I have a dataset of 371 observations.
>
> When I run coxph with numeric variables it works fine.
>
> However, when I try to add factor (categorical) variables it returns
> "Ran out of iterations and the model did not converge"
>
>
>
> Of note, when I restructure all factors to binary variables with dummy
> and use glmnet-lasso the model converges.
>
>
>
> Here are examples of the code and output (including summary
> description of the variables):
>
>> maxSTree.cox <- coxph (Surv(time,status)~Chemo_Simple, data=dataset)
>
>
>
> Warning message:
>
> In fitter(X, Y, strats, offset, init, control, weights = weights, :
>
> Ran out of iterations and did not converge
>
>
>
>> summary (dataset$Chemo_Simple)
>
> Anthra_HDAC Anthra_Plus ArsenicAtra
> ATRA ATRA_GO
>
> 0 163 2 12
> 0 2
>
> ATRA_IDA Demeth_HistoneDAC Flu_HDAC Flu_HDAC_plus
> HDAC_Clof HDAC_only
>
> 0 34 37 4
> 24 1
>
> HDAC_Plus LowArac LowDAC_Clof MYLO_IL11
> Phase1
>
> 4 8 30 5
> 5
>
> SCT StdARAC_Anthra StdAraC_Plus Targeted
> VNP40101M
>
> 0 0 0 13
> 23
>
>
>
>
>
> HELP !!!!
>
You have 371 observations, but did not indicate how many events you have in that dataset. A cross tabulation of the 'status' factor with Chemo_Simple is likely to be enlightening to get a sense of the distribution of events for each level of Chemo_Simple.
Chemo_Simple has a number of levels with rather low counts (ignoring the levels with 0's for the moment), which is likely to be a part of the problem. There is likely to be an issue fitting the model for some of these levels, bearing in mind that your "reference level" of Anthra_HDAC has a large proportion of the observations and with the default treatment contrasts, each of the other levels will be compared to it.
You should also run:
dataset$Chemo_Simple <- factor(dataset$Chemo_Simple)
to get rid of the unused factor levels. A quick check of the coxph() code versus, for example, lm()/glm(), reveals that while the latter will drop unused factor levels when creating the internal model dataframe, the former will not. A check of some test output here with coxph() suggests that the unused factor levels will simply appear as NA's in the coxph() output without affecting the levels that are present, but it will be cleaner to remove them a priori.
It is also likely that you don't have enough events to handle the effective covariate degrees of freedom that this single factor model will have. By my count (the formatting above is corrupted), you have 16 non-zero factor levels, which would be 15 covariate degrees of freedom consumed by this single factor. A general rule of thumb for Cox models (and logistic regression) is to have 20 events per covariate degree of freedom to avoid overfitting, which means that you would really need 300 "events". In this context, events are the smaller count of the two possible levels of "status". Since you only have 371 total observations, there is no way that you could have 300 events, since you would need to have at least 600 observations. Thus, it is likely that you are attempting to overfit the model to the data as well.
The issue of using glmnet on a matrix of separate dummy variables and it working is likely to be an outcome of the LASSO method penalizing/shrinking the factor levels that are irrelevant to have coefficients of 0. Thus, I would envision that the effective number of your factor levels is reduced as a consequence, allowing the model to be fit.
You likely need to consider collapsing some of the low count factor levels into an "Other" category, if it makes contextual sense to do so for your data.
Regards,
Marc Schwartz
More information about the R-help
mailing list