# [R] covariate selection in cox model (counting process)

Mayeul KAUFFMANN mayeul.kauffmann at tiscali.fr
Tue Jul 27 00:28:45 CEST 2004

```Thank you a lot for your time and your answer, Thomas. Like all good
answers, it raised new questions for me ;-)

>In the case of recurrent events coxph() is not
> using maximum likelihood or even maximum partial likelihood. It is
> maximising the quantity that (roughly speaking) would be the partial
> likelihood if the covariates explained all the cluster differences.

I could have non repeating events by removing countries once they have
experienced a war. But I'm not sure it will change the estimation
procedure since this will change the dataset only, not the formula
coxph(Surv(start,stop,status)~x1+x2+...+cluster(id),robust=T)

I am not sure I understood you well: do you really mean "recurrent events"
alone or "any counting process notation (including allowing for recurrent
events)".

I thought the counting process notation did not differ really from the Cox
model in R, since Terry M. Therneau (A Package for Survival Analysis in S,
April 22, 1996) concludes his mathematical section "3.3 Cox Model" by "The
above notation is derived from the counting process representation [...]
It allows very naturally for several extensions to the original Cox model
formulation: multiple events per subject, discontinuous intervals of risk
[...],left truncation." (I used it to introduce 1. time-dependent
covariates, some covariates changing yearly, other irregularly, and 2.
left truncation: not all countries existed at the beginning of the study)

>In the case of recurrent events coxph() is not
> using maximum likelihood or even maximum partial likelihood.

Then, what does fit\$loglik give in this case? Still a likelihood or a
valid criterion to maximise ?
If not, how to get ("manually") the criterion that was maximsed?

That's of interest for me since
> I created artificial covariates measuring the proximity since some
events: exp(-days.since.event/a.chosen.parameter).

...and I used fit\$loglik to chose a.chosen.parameter from 8 values, for 3
types of events:

la<-c(263.5, 526.9,1053.9,2107.8,4215.6,8431.1) #list of values to choose
from
z<-NULL;for(a1 in la) for(a2 in la) for(a3 in la) {coxtmp <-
(coxph(Surv(start,stop,status)~
+I(exp(-days.since.event.of.type.one/a1))
+I(exp(-days.since.event.of.type.two/a2))
+I(exp(-days.since.event.of.type.three/a3))
+ other.time.dependent.covariates
+cluster(id)
,data=x,robust=T))
rbind(z,c(a1,a2,a3,coxtmp\$wald.test, coxtmp\$rscore, coxtmp\$loglik,
coxtmp\$score))->z
}
z <- data.frame(z)
names(z) <- c("a1","a2", "a3","wald.test", "rscore",
"NULLloglik","loglik", "score")
z[which.max(z\$rscore),]
z[which.max(z\$loglik),]

The last two commands gave me almost always the same set for c(a1,a2,a3).
But they sometimes differed significantly on some models.

Which criteria (if any ?!) should I use to select the best set c(a1,a2,a3)
?

(If you wish to see what the proximity variables look like, run the
following code. The dashed lines show the "half life" of the proximity
variable,here=6 months, which is determined by a.chosen.parameter, e.g.
a1=la:
#start of code
curve(exp(-(x)/263.5),0,8*365.25,xlab="number of days since last political
regime change (dsrc)",ylab="Proximity of political regime change =
exp(-dsrc/263.5)",las=1)
axis(1,at=365.25/2, labels= "(6 months)");axis(2,at=seq(0,1,.1),las=1)
lines(c(365.25/2,365.25/2,-110),c(-.05,0.5,0.5),lty="dashed")
#end of code)

Thanks a lot again.

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

```