[R] covariate selection in cox model (counting process)

Tue Jul 27 16:34:09 CEST 2004

On Tue, 27 Jul 2004, Mayeul KAUFFMANN wrote:

> Thank you a lot for your time and your answer, Thomas. Like all good
> answers, it raised new questions for me ;-)
>
> >In the case of recurrent events coxph() is not
> > using maximum likelihood or even maximum partial likelihood. It is
> > maximising the quantity that (roughly speaking) would be the partial
> > likelihood if the covariates explained all the cluster differences.
>
> I could have non repeating events by removing countries once they have
> experienced a war. But I'm not sure it will change the estimation
> procedure since this will change the dataset only, not the formula
> coxph(Surv(start,stop,status)~x1+x2+...+cluster(id),robust=T)
>
> I am not sure I understood you well: do you really mean "recurrent events"
> alone or "any counting process notation (including allowing for recurrent
> events)".

No, I mean recurrent events.  With counting process notation but no
recurrent revents the partial likelihood is still valid, and the approach
of treating it as a real likelihood for AIC (and presumably BIC) makes
sense.

Roughly speaking, you can't tell there is dependence until you see
multiple events.

>
> I thought the counting process notation did not differ really from the Cox
> model in R, since Terry M. Therneau (A Package for Survival Analysis in S,
> April 22, 1996) concludes his mathematical section "3.3 Cox Model" by "The
> above notation is derived from the counting process representation [...]
> It allows very naturally for several extensions to the original Cox model
> formulation: multiple events per subject, discontinuous intervals of risk
> [...],left truncation." (I used it to introduce 1. time-dependent
> covariates, some covariates changing yearly, other irregularly, and 2.
> left truncation: not all countries existed at the beginning of the study)
>
>
> >In the case of recurrent events coxph() is not
> > using maximum likelihood or even maximum partial likelihood.
>
> Then, what does fit$loglik give in this case? Still a likelihood or a
> valid criterion to maximise ?
> If not, how to get ("manually") the criterion that was maximsed?

fit$loglik gives the criterion that was maximised.  This is the function
of the data that *would be* the partial likelihood if there was no
within-country dependence.

This is a convenient criterion function because it is easy to maximise,
and it is known to give valid (and reasonably efficient) estimates for
what you might call a "proportional rates" model in the case of recurrent
events.  However, it no longer has the same claim to be a "real
likelihood" that the Cox partial likelihood does, because it is not
modelling the dependence.

>
> That's of interest for me since
> > I created artificial covariates measuring the proximity since some
> events: exp(-days.since.event/a.chosen.parameter).
>
> ...and I used fit$loglik to chose a.chosen.parameter from 8 values, for 3
> types of events:

That's fine -- within a single model maximising the criterion function is
valid.  The problem is that you can not assume either that
differences between nested models have a chisquared distribution nor that
the expected change in loglik is the same as the number of parameters.

This means that you don't have any absolute scale for choosing
penalties, which is a problem in model selection -- it is hard to balance the
increase in fit$loglik with the increase in model complexity.

In principle you could use cross-validation to estimate the
cost-complexity tradeoff in these models, but this requires the ability to
compute the criterion function on a subset not included in the model,
which is not entirely straightforward.

	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle