[R] How to use the whole dataset (including between events) in Cox model (time-varying covariates) ?

Fri Aug 13 03:03:05 CEST 2004

Hello,

coxph does not use any information that are in the dataset between event
times (or "death times") , since computation only occurs at event  times.

For instance, removing observations when there is no event at that time in
the whole dataset does not change the results:
> set.seed(1)
> data <-
as.data.frame(cbind(start=c(1:5,1:5,1:4),stop=c(2:6,2:6,2:5),status=c(rep(
0,7),1,rep(0,5),1),id=c(rep(1,5),rep(2,5),rep(3,4)),x1=rnorm(14)))
> data
start stop status id x1
1 1 2 0 1 -0.6264538
2 2 3 0 1 0.1836433
3 3 4 0 1 -0.8356286
4 4 5 0 1 1.5952808
5 5 6 0 1 0.3295078
6 1 2 0 2 -0.8204684
7 2 3 0 2 0.4874291
8 3 4 1 2 0.7383247
9 4 5 0 2 0.5757814
10 5 6 0 2 -0.3053884
11 1 2 0 3 1.5117812
12 2 3 0 3 0.3898432
13 3 4 0 3 -0.6212406
14 4 5 1 3 -2.2146999
coxph(Surv(start, stop,status)~ cluster(id)+x1,data=data ,robust=T)
coxph(Surv(start, stop,status)~ cluster(id)+x1,data=subset(data,stop %in%
4:5) ,robust=T) # the same !!! (except n)

First, some data is lost.
Second, this loss could be an important problem when  there is a
time-varying covariate that changes quicker than the frequency  of events.
Specifically, I have a covariate which has low values most of the time. It
sometimes jumps to high values and that is hypothesized as greatly
increasing the risk of an event.
With rare events, the effect of this covariate will only be measured at
event times. Chances are that the only time such a covariate is recorded
at high level, the individual for which it is measured as being high is
having an event.
This may bias the estimated coefficient.

Here is my question:
How to fully use the dataset?

(that is: how to have really _time-varying_ covariates (even if they
change step by step, not continuously), not covariates whose changes are
measured only at event time )

Ideally, the full dataset would be use to estimate the parameters, or at
least to estimate the standard error of the estimated parameters.
Any ideas ???
.
.
.

A second best (which might require less work) would be to use all the
dataset to assess the predictive power of the model.

Maybe by using the expected number of events for an individual over the
time interval that they were observed to be at risk
> predict(coxfit,type="expected")
and compare it with observed number of events
(does it use all data and takes into account all the baseline hazard, even
between events?)

Or, if not,  following Brian D. Ripley suggestion about the baseline
hazard: "As an approximation you can smooth the fitted
baseline cumulative hazard (e.g. by package pspline) and ask for its
derivative" (https://stat.ethz.ch/pipermail/r-help/2004-July/052376.html)

the following code could be use to estimate (and plot) a smooth baseline
hazard:
> t<-seq(min(data$start),max(data$stop),length=100)
> lines(t,
predict(sm.spline(x=basehaz(coxfit)[,2],y=basehaz(coxfit)[,1],norder=2),
t,1))
#there is a problem with this code. One should add the contraint that the
baseline hazard cannot be negative.

The following computes the parametric part of the cox model.
> risk <- predict(coxfit, type='risk') # gives exp(X'b)

something like
> baseline.hazard*risk
would give the true risk at any time (but it would be probably much harder
to compute)

which could help assess the predictive power of the model.
(still a lot of work)

Thanks in advance for any help or comment.

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France