[R] coxph won't converge when including categorical (factor) variables

Thu Jul 11 15:16:34 CEST 2013

I have found the answer for my second question and am posting here for the benefit of the community:

pec accepts only R-objects for which a predictSurvProb method exists and glmnet is not such an object.

    Currently, predictSurvProb methods are available for the following R-objects:
    matrix
    aalen, cox.aalen from library(timereg)
    mfp from library(mfp)
    phnnet, survnnet from library(survnnet)
    rpart (from library(rpart))
    coxph, survfit from library(survival)
    cph, psm from library(rms)
    prodlim from library(prodlim)
    glm from library(stats)

For calculating the brier score for glmnet one needs to use the peperr package with the c060 library that wraps glmnet as an object suitable for peperr.

    peperr_glmnet_noerror <- peperr(response=Surv(time, status), x=x, 
                        fit.fun=fit.glmnet, args.fit=list(family="cox"),
                        complexity=complexity.glmnet,args.complexity=list(family="cox"),
                        indices=resample.indices(n=length(time), method="boot", sample.n=10))

To get the integrated brier score for the entire model it seems one needs to use the ipec function but I still need to research that.

Many thanks to Thomas Hielscher who authored the c060 package and was extremely kind to help me with this.

-----Original Message-----
From: E Joffe [mailto:ejoffe at hotmail.com] 
Sent: Sunday, July 07, 2013 10:02 AM
To: 'Marc Schwartz'
Cc: 'r-help at r-project.org'
Subject: RE: [R] coxph won't converge when including categorical (factor) variables

Hello all,

I have posted a reply to Mark and Jeff out of my own fault sent it to them personally and not the group.
I will re-post the question and Mark's answer so that the community can benefit from his comments.
It seems that my posts are not in accordance with the guidelines and I will try to remedy that as well.
I apologize for the mess I have made !!!
I am afraid I can't post the actual data (as it is Protected Health Information).

In any case here are the question, code, output and Mark's comment (which can be summarized that there is very little information about the pec package within the community and that the best solution would be to try and contact the author of the package - which I will and then repost the answer):

I am attempting to evaluate the prediction error of a coxph model that was built after feature selection with glmnet.

In the preprocessing stage I used na.omit (dataset) to remove NAs. I reconstructed all my factor variables into binary variables with dummies (using model.matrix) I then used glmnet lasso to fit a cox model and select the best performing features. Then I fit a coxph model using only these feature.

When I try to evaluate the model using pec and a bootstrap I get an error that the prediction matrix has wrong dimensions.

In the original dataset there are 313 variables. In the coxph model (glmnet.cox in my code) there are 313 df. However when I run pec the prediction matrix is noted as having 318 variables and therefore doesn't fit the testset which has 313 variables (+the status and time variables).

It seems that pec does something to the variables while retraining the coxph model on the bootstrap samples. I tried setting singular.ok=FALSE in the coxph model without avail.

Here is a summary of key variables (prior to restructuring for glmnet) Followed by the code I ran and the errors:

> summary (trainSet)
     time              status         Age_at_Dx       CREATININE   
 Min.   :  0.1429   Min.   :0.0000   Min.   :14.81   Min.   :0.400  
 1st Qu.: 22.0714   1st Qu.:1.0000   1st Qu.:52.51   1st Qu.:0.800  
 Median : 52.9286   Median :1.0000   Median :63.79   Median :0.900  
 Mean   : 96.5415   Mean   :0.7826   Mean   :61.01   Mean   :1.042  
 3rd Qu.:154.6071   3rd Qu.:1.0000   3rd Qu.:73.01   3rd Qu.:1.125  
 Max.   :437.7143   Max.   :1.0000   Max.   :87.23   Max.   :6.000  
 Performance_Status    ALBUMIN          Cyto             WBC        
 Min.   :0.0000     Min.   :0.70   Min.   : 1.000   Min.   :  0.60  
 1st Qu.:1.0000     1st Qu.:3.00   1st Qu.: 4.000   1st Qu.:  3.20  
     Median :1.0000     Median :3.60   Median : 4.000   Median :  9.20  
 Mean   :0.9728     Mean   :3.48   Mean   : 6.802   Mean   : 28.35  
 3rd Qu.:1.0000     3rd Qu.:4.00   3rd Qu.:11.000   3rd Qu.: 33.10  
 Max.   :4.0000     Max.   :5.20   Max.   :17.000   Max.   :373.00  
  PRKCD_pT507             HGB            Maxblast          CD19       
 Min.   :-2.429605   Min.   : 5.400   Min.   : 0.00   Min.   : 0.000  
  1st Qu.:-0.627005   1st Qu.: 8.500   1st Qu.:24.00   1st Qu.: 1.000  
 Median :-0.013117   Median : 9.550   Median :40.00   Median : 2.000  
 Mean   : 0.006432   Mean   : 9.689   Mean   :43.74   Mean   : 6.312  
 3rd Qu.: 0.559782   3rd Qu.:10.800   3rd Qu.:65.25   3rd Qu.: 5.000  
 Max.   : 2.963658   Max.   :14.300   Max.   :98.00   Max.   :98.000  
     GAPDH                CD74              TP53               Fli1          
 Min.   :-2.391021   Min.   :-1.7902   Min.   :-1.64244   Min.   :-3.723143  
 1st Qu.:-0.662164   1st Qu.:-0.5502   1st Qu.:-0.53685   1st Qu.:-0.559783  
 Median : 0.003236   Median :-0.1546   Median :-0.10928   Median :-0.002154  
 Mean   : 0.015759   Mean   : 0.0235   Mean   :-0.02285   Mean   :-0.016555  
 3rd Qu.: 0.632472   3rd Qu.: 0.4759   3rd Qu.: 0.29869   3rd Qu.: 0.554521  
 Max.   : 3.512741   Max.   : 3.7226   Max.   : 5.39242   Max.   : 4.415324  
      LDH              SQSTM0            BM_BLAST         CCND3         
 Min.   :    8.4   Min.   :-1.56639   Min.   : 0.00   Min.   :-2.44772  
 1st Qu.:  562.8   1st Qu.:-0.48762   1st Qu.:31.00   1st Qu.:-0.59042  
 Median :  952.5   Median :-0.05746   Median :46.50   Median :-0.08412  
 Mean   : 1617.9   Mean   :-0.02272   Mean   :50.64   Mean   :-0.02506  
 3rd Qu.: 1596.8   3rd Qu.: 0.33718   3rd Qu.:72.00   3rd Qu.: 0.48353  
 Max.   :36708.0   Max.   : 3.59456   Max.   :98.00   Max.   : 3.58761  
      GAB2                BAX               ABS_BLST           PRKAA1_2       
 Min.   :-3.201564   Min.   :-3.230737   Min.   :     0.0   Min.   :-1.90671  
 1st Qu.:-0.640279   1st Qu.:-0.593174   1st Qu.:    99.8   1st Qu.:-0.66896  
 Median : 0.007018   Median :-0.106097   Median :  1836.5   Median :-0.10586  
 Mean   :-0.013284   Mean   :-0.007051   Mean   : 15024.1   Mean   :-0.03531  
  3rd Qu.: 0.635609   3rd Qu.: 0.588098   3rd Qu.: 11178.0   3rd Qu.: 0.44906  
 Max.   : 2.396419   Max.   : 3.439942   Max.   :358080.0   Max.   : 3.60037  
    RPS6KB1              SPP1               MAPT                ARC          
 Min.   :-2.05490   Min.   :-1.39728   Min.   :-1.826101   Min.   :-2.67081  
 1st Qu.:-0.65736   1st Qu.:-0.55249   1st Qu.:-0.573470   1st Qu.:-0.63908  
 Median :-0.12007   Median :-0.15156   Median :-0.046306   Median :-0.07845  
 Mean   :-0.06115   Mean   : 0.02759   Mean   : 0.009604   Mean   :-0.01082  
 3rd Qu.: 0.49055   3rd Qu.: 0.35195   3rd Qu.: 0.495406   3rd Qu.: 0.63638  
  Max.   : 3.54709   Max.   : 4.44919   Max.   : 4.063568   Max.   : 2.52810  
  FOXO1_pT24_FOXO3_pT32      ATG7               SMAD5           RPS6_pS235_236     
  Min.   :-2.2506       Min.   :-2.916018   Min.   :-2.321250   Min.   :-3.229642  
  1st Qu.:-0.5590       1st Qu.:-0.621408   1st Qu.:-0.480924   1st Qu.:-0.633143  
  Median :-0.1281       Median : 0.003795   Median : 0.003186   Median :-0.003345  
  Mean   :-0.0121       Mean   :-0.029199   Mean   : 0.010663   Mean   :-0.015028  
  3rd Qu.: 0.4415       3rd Qu.: 0.590721   3rd Qu.: 0.502824   3rd Qu.: 0.660224  
  Max.   : 3.1239       Max.   : 2.757525   Max.   : 2.238310   Max.   : 2.816105  
      LSD1             HDAC2               SFN           
  Min.   :-4.5434   Min.   :-2.40682   Min.   :-2.082592  
  1st Qu.:-0.5654   1st Qu.:-0.49165   1st Qu.:-0.554862  
  Median : 0.0477   Median : 0.02828   Median :-0.116248  
  Mean   :-0.0165   Mean   : 0.04641   Mean   : 0.006255  
  3rd Qu.: 0.6044   3rd Qu.: 0.47826   3rd Qu.: 0.506527  
  Max.   : 1.8736   Max.   : 4.22724   Max.   : 3.733445

library (survival)
library (pec)
library (glmnet)

  ##-----------------------------------------------------------------------------------         -------------------
   ##                                    FEATURE SELECTION BY GLMNET-LASSO
   ##----------------------------------------------------------------------------------     ---------------------
   ## 
   ## This function takes a predictor matrix and a Surv obj and uses the glmnet lasso      regularization 
   ## method to select the most predictive features and create a coxph object
   ## predict_matrix is the 'dataset' reformated to replace categorical variables to     binary with dummy

#creat Y (survival matrix) for glmnet
  surv_obj <- Surv(trainSet$time,trainSet$status) 

 ## tranform categorical variables into binary variables with dummy for trainSet
  predict_matrix <- model.matrix(~ ., data=trainSet, 
                                 contrasts.arg = lapply (trainSet[,sapply(trainSet,     is.factor)], contrasts, contrasts=FALSE))

  ## remove the status/time variables from the predictor matrix (x) for glmnet
  predict_matrix <- subset (predict_matrix, select=c(-time,-status))

  ## create a glmnet cox object using lasso regularization
  glmnet.obj <- glmnet (predict_matrix, surv_obj, family="cox")

  # find lambda for which dev.ratio is max
  max.dev.index <- which.max(glmnet.obj$dev.ratio) 
      optimal.lambda <- glmnet.obj$lambda[max.dev.index] 

  # take beta for optimal lambda
  optimal.beta  <- glmnet.obj$beta[,max.dev.index] 

  # find non zero beta coef
  nonzero.coef <- abs(optimal.beta)>0
  selectedBeta <- optimal.beta[nonzero.coef] 

  # take only covariates for which beta is not zero 
  selectedVar   <- predict_matrix[,nonzero.coef] 

 # create a dataframe for trainSet with time, status and selected variables in binary      representation for evaluation in pec
 reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))

  # create coxph object with pre-defined coefficients 
   glmnet.cox <- coxph(surv_obj ~ selectedVar,data = reformat_dataSet,model=TRUE,      singular.ok=FALSE,init=selectedBeta,iter=0)

 ##------------------------------------------------------------------------------------   -------------------
  ##                                    MODEL PERFORMANCE
  ##-----------------------------------------------------------------------------------    --------------------
  ##
  set.seed(17743)
  pec.f <- as.formula(Hist(time,status) ~ 1)

   ## Brier score for the cox-glmnet model
   brierGlmnet <- pec(list(glmnet.cox), data= reformat_dataSet , splitMethod="BootCV",  B=50)

Here is the result:

Prediction matrix has wrong dimensions:
368 rows and 318 columns.           --->(but there are only 313 predicting vars     in the dataset)
 But requested are predicted probabilities for
 138 subjects (rows) in newdata and 315 time points (columns) ---à (315=313 predicting     vars+status+time)
This may happen when some covariate values are missing in newdata!?
Warning in FUN(1:2[[2L]], ...) :

Oddly, when I call print on the glmnet.cox object I get that it has 313 variables and not as pec outputs (318) print(glmnet.cox) Likelihood ratio test=0 on 313 df, p=1 n= 368, number of events= 288

Here is Mark's response:

I suspect that you are going to need to contact the author/maintainer of the pec package for detailed assistance. From a search of the R list archives, there are not that many posts pertaining to the package, which is generally an indication of the breadth of usage within the community. The lack of relevant posts would likely correlate to the low number of users who may be in a position to assist you. Thus, the package author/maintainer in this case, will likely be the best and most expedient resource.

I hope that you find the above useful in some manner.

Regards,

Marc

-----Original Message-----
From: Marc Schwartz [mailto:marc_schwartz at me.com]
Sent: Saturday, July 06, 2013 8:46 AM
To: E Joffe
Cc: r-help at r-project.org
Subject: Re: [R] coxph won't converge when including categorical (factor) variables

On Jul 6, 2013, at 7:04 AM, E Joffe <ejoffe at hotmail.com> wrote:

> Hello,
> 
> 
> 
> [rephrasing and reposting  of a previous question (that was not
> answered) with new information]
> 
> 
> 
> I have a dataset of 371 observations.
> 
> When I run coxph with numeric variables it works fine.
> 
> However, when I try to add factor (categorical) variables it returns 
> "Ran out of iterations and the model did not converge"
> 
> 
> 
> Of note, when I restructure all factors to binary variables with dummy 
> and use glmnet-lasso the model converges.
> 
> 
> 
> Here are examples of the code and output (including summary 
> description of the variables):
> 
>> maxSTree.cox <- coxph (Surv(time,status)~Chemo_Simple, data=dataset)
> 
> 
> 
> Warning message:
> 
> In fitter(X, Y, strats, offset, init, control, weights = weights,  :
> 
>  Ran out of iterations and did not converge
> 
> 
> 
>> summary (dataset$Chemo_Simple)
> 
>                        Anthra_HDAC       Anthra_Plus       ArsenicAtra
> ATRA           ATRA_GO 
> 
>                0               163                 2                12
> 0                 2 
> 
>         ATRA_IDA Demeth_HistoneDAC          Flu_HDAC     Flu_HDAC_plus
> HDAC_Clof         HDAC_only 
> 
>                0                34                37                 4
> 24                 1 
> 
>        HDAC_Plus        LowArac       LowDAC_Clof         MYLO_IL11
> Phase1
> 
>                4                 8                30                 5
> 5
> 
>              SCT    StdARAC_Anthra      StdAraC_Plus          Targeted
> VNP40101M
> 
>                0                 0                 0                13
> 23
> 
> 
> 
> 
> 
> HELP !!!!
> 

You have 371 observations, but did not indicate how many events you have in that dataset. A cross tabulation of the 'status' factor with Chemo_Simple is likely to be enlightening to get a sense of the distribution of events for each level of Chemo_Simple.

Chemo_Simple has a number of levels with rather low counts (ignoring the levels with 0's for the moment), which is likely to be a part of the problem. There is likely to be an issue fitting the model for some of these levels, bearing in mind that your "reference level" of Anthra_HDAC has a large proportion of the observations and with the default treatment contrasts, each of the other levels will be compared to it.

You should also run:

  dataset$Chemo_Simple <- factor(dataset$Chemo_Simple)

to get rid of the unused factor levels. A quick check of the coxph() code versus, for example, lm()/glm(), reveals that while the latter will drop unused factor levels when creating the internal model dataframe, the former will not. A check of some test output here with coxph() suggests that the unused factor levels will simply appear as NA's in the coxph() output without affecting the levels that are present, but it will be cleaner to remove them a priori.

It is also likely that you don't have enough events to handle the effective covariate degrees of freedom that this single factor model will have. By my count (the formatting above is corrupted), you have 16 non-zero factor levels, which would be 15 covariate degrees of freedom consumed by this single factor. A general rule of thumb for Cox models (and logistic regression) is to have 20 events per covariate degree of freedom to avoid overfitting, which means that you would really need 300 "events". In this context, events are the smaller count of the two possible levels of "status". Since you only have 371 total observations, there is no way that you could have 300 events, since you would need to have at least 600 observations. Thus, it is likely that you are attempting to overfit the model to the data as well.

The issue of using glmnet on a matrix of separate dummy variables and it working is likely to be an outcome of the LASSO method penalizing/shrinking the factor levels that are irrelevant to have coefficients of 0. Thus, I would envision that the effective number of your factor levels is reduced as a consequence, allowing the model to be fit.

You likely need to consider collapsing some of the low count factor levels into an "Other" category, if it makes contextual sense to do so for your data. 

Regards,

Marc Schwartz