[R] coxph won't converge when including categorical (factor) variables
Marc Schwartz
marc_schwartz at me.com
Sat Jul 6 15:46:06 CEST 2013
On Jul 6, 2013, at 7:04 AM, E Joffe <ejoffe at hotmail.com> wrote:
> Hello,
>
>
>
> [rephrasing and reposting of a previous question (that was not answered)
> with new information]
>
>
>
> I have a dataset of 371 observations.
>
> When I run coxph with numeric variables it works fine.
>
> However, when I try to add factor (categorical) variables it returns "Ran
> out of iterations and the model did not converge"
>
>
>
> Of note, when I restructure all factors to binary variables with dummy and
> use glmnet-lasso the model converges.
>
>
>
> Here are examples of the code and output (including summary description of
> the variables):
>
>> maxSTree.cox <- coxph (Surv(time,status)~Chemo_Simple, data=dataset)
>
>
>
> Warning message:
>
> In fitter(X, Y, strats, offset, init, control, weights = weights, :
>
> Ran out of iterations and did not converge
>
>
>
>> summary (dataset$Chemo_Simple)
>
> Anthra_HDAC Anthra_Plus ArsenicAtra
> ATRA ATRA_GO
>
> 0 163 2 12
> 0 2
>
> ATRA_IDA Demeth_HistoneDAC Flu_HDAC Flu_HDAC_plus
> HDAC_Clof HDAC_only
>
> 0 34 37 4
> 24 1
>
> HDAC_Plus LowArac LowDAC_Clof MYLO_IL11
> Phase1
>
> 4 8 30 5
> 5
>
> SCT StdARAC_Anthra StdAraC_Plus Targeted
> VNP40101M
>
> 0 0 0 13
> 23
>
>
>
>
>
> HELP !!!!
>
You have 371 observations, but did not indicate how many events you have in that dataset. A cross tabulation of the 'status' factor with Chemo_Simple is likely to be enlightening to get a sense of the distribution of events for each level of Chemo_Simple.
Chemo_Simple has a number of levels with rather low counts (ignoring the levels with 0's for the moment), which is likely to be a part of the problem. There is likely to be an issue fitting the model for some of these levels, bearing in mind that your "reference level" of Anthra_HDAC has a large proportion of the observations and with the default treatment contrasts, each of the other levels will be compared to it.
You should also run:
dataset$Chemo_Simple <- factor(dataset$Chemo_Simple)
to get rid of the unused factor levels. A quick check of the coxph() code versus, for example, lm()/glm(), reveals that while the latter will drop unused factor levels when creating the internal model dataframe, the former will not. A check of some test output here with coxph() suggests that the unused factor levels will simply appear as NA's in the coxph() output without affecting the levels that are present, but it will be cleaner to remove them a priori.
It is also likely that you don't have enough events to handle the effective covariate degrees of freedom that this single factor model will have. By my count (the formatting above is corrupted), you have 16 non-zero factor levels, which would be 15 covariate degrees of freedom consumed by this single factor. A general rule of thumb for Cox models (and logistic regression) is to have 20 events per covariate degree of freedom to avoid overfitting, which means that you would really need 300 "events". In this context, events are the smaller count of the two possible levels of "status". Since you only have 371 total observations, there is no way that you could have 300 events, since you would need to have at least 600 observations. Thus, it is likely that you are attempting to overfit the model to the data as well.
The issue of using glmnet on a matrix of separate dummy variables and it working is likely to be an outcome of the LASSO method penalizing/shrinking the factor levels that are irrelevant to have coefficients of 0. Thus, I would envision that the effective number of your factor levels is reduced as a consequence, allowing the model to be fit.
You likely need to consider collapsing some of the low count factor levels into an "Other" category, if it makes contextual sense to do so for your data.
Regards,
Marc Schwartz
More information about the R-help
mailing list