[R] results of a survival analysis change when converting the data to counting process format

Fri Aug 23 13:38:19 CEST 2019

I think this is a case of (a - x) + x != a in floating arithmetic. When updating the risk set, you subtract sums of covariates at the end of a time interval, then add them back at the beginning of the next interval. Something like that, anyway. As in

> x <- rnorm(1000)
> sum(c(x,-x))
[1] -1.387779e-17

Or (worse, since sum() is smart and doesn't operate sequentially)

> tt <- 0
> for (i in x) tt <- tt+i
> for (i in x) tt <- tt-i
> tt
[1] 2.553513e-14

-pd

> On 23 Aug 2019, at 11:12 , Göran Broström <goran.brostrom using umu.se> wrote:
> 
> 
> 
> Den 2019-08-22 kl. 21:48, skrev Göran Broström:
>> On 2019-08-18 19:10, Ferenci Tamas wrote:
>>> Dear All,
>>> 
>>> Consider the following simple example:
>>> 
>>> library( survival )
>>> data( veteran )
>>> 
>>> coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) )
>>>           trt        prior        karno
>>>   0.180197194 -0.005550919 -0.033771018
>>> 
>>> Note that we have neither time-dependent covariates, nor time-varying
>>> coefficients, so the results should be the same if we change to
>>> counting process format, no matter where we cut the times.
>>> 
>>> That's true if we cut at event times:
>>> 
>>> veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno,
>>>                         data = veteran, cut = unique( veteran$time ) )
>>> 
>>> coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran2 ) )
>>>           trt        prior        karno
>>>   0.180197194 -0.005550919 -0.033771018
>>> 
>>> But quite interestingly not true, if we cut at every day:
>>> 
>>> veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno,
>>>                         data = veteran, cut = 1:max(veteran$time) )
>>> 
>>> coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran3 ) )
>>>           trt        prior        karno
>>>   0.180197215 -0.005550913 -0.033771016
>>> 
>>> The difference is not large, but definitely more than just a rounding
>>> error, or something like that.
>>> 
>>> What's going on? How can the results get wrong, especially by
>>> including more cutpoints?
>> All results are wrong, but they are useful (paraphrasing George EP Box).
> 
> That said, it is a little surprising: The generated risk sets are (should be) identical in all cases, and one would expect rounding errors to be the same. But data get stored differently, and ... who knows?
> 
> I tried your examples on my computer and got exactly the same results as you. Which surprised me.
> 
> G,
> 
>> Göran
>>> 
>>> Thank you in advance,
>>> Tamas
>>> 
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com