[R] results of a survival analysis change when converting the data to counting process format

Fri Aug 23 16:39:49 CEST 2019

Thanks for all those comments, and thanks Terry for the repair!

Tamas

2019. augusztus 23., 14:40:05, írtad:

I've spent some time chasing this down, and another error of similar type.   It's repaired in version 3.0-9  (soon to go to CRAN, or at least that is the plan  --- I'm working through one last fail in the checks.) 

For those who want to know more---

When there is start, stop data such as this:  (0, 101]  (101, 123]   (123, 400], ....  the algorithm used by coxph is to work from largest to smallest time.  Every time it hits the start of an interval (400, 123, 101) that observation is added into the risk set, and whenever it hits the start of an interval the observation is removed (123, 101, 0).    "Add" and "remove" also means "add to the current estimate of the running mean and variance" of the covariates.   At each event time that current mean and variance are are used to update the loglik and its derivatives.

When a subject is diced into little peices of (1,2] (2,3] (3,4], ....  like the last fit below, the sheer number of updates to the running mean/variance can lead to accumulated round off error.   The code works hard to avoid this, and some of that is subtle --- round off error is a tricky business --- and has gone through several refinements.    Nevertheless, the result for this data set should not have been that far off.  Another user example from about the same time was worse so I've been focusing on that one: one of the two splits led to an exp() overflow and one didn't, giving results that were completely different.  This led to a more careful review and some changes that addressed the example below as well.

Terry T.

On 8/23/19 5:00 AM, r-help-request using r-project.org wrote:

On 2019-08-18 19:10, Ferenci Tamas wrote:

Dear All,

Consider the following simple example:

library( survival )

data( veteran )

coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) )

          trt        prior        karno

  0.180197194 -0.005550919 -0.033771018

Note that we have neither time-dependent covariates, nor time-varying

coefficients, so the results should be the same if we change to

counting process format, no matter where we cut the times.

That's true if we cut at event times:

veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno,

                        data = veteran, cut = unique( veteran$time ) )

coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran2 ) )

          trt        prior        karno

  0.180197194 -0.005550919 -0.033771018

But quite interestingly not true, if we cut at every day:

veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno,

                        data = veteran, cut = 1:max(veteran$time) )

coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran3 ) )

          trt        prior        karno

  0.180197215 -0.005550913 -0.033771016

The difference is not large, but definitely more than just a rounding

error, or something like that.

What's going on? How can the results get wrong, especially by

including more cutpoints?