[R] How to simulate informative censoring in a Cox PH model?

Greg Snow 538280 at gmail.com
Fri Jul 31 18:06:12 CEST 2015


Daniel,

Basically just responding to your last paragraph (the others are
interesting, but I think that you are learning as much as anyone and I
don't currently have any other suggestions).

I am not an expert on copulas, so this is a basic understanding, you
should learn more about them if you choose to use them.  The main idea
of a copula is that it is a bivariate or multivariate distribution
where all the variables have uniform marginal distributions but the
variables are not independent from each other.  How I would suggest
using them is to choose a copula and generate random points from a
bivariate copula, then put those (uniform) values into the inverse pdf
function for the Weibull (or other distribution), one of which is the
event time, the other the censoring time.  This will give you times
that (marginally) come from the distributions of interest, but are not
independent (so would be considered informative censoring).  Repeat
this with different levels of relationship in the copula to see how
much difference it makes in your simulations.

On Thu, Jul 30, 2015 at 2:02 PM, Daniel Meddings <dpmeddings at gmail.com> wrote:
> Thanks Greg once more for taking the time to reply. I certainly agree that
> this is not a simple set-up, although it is realistic I think. In short you
> are correct about model mis-specification being the key to producing more
> biased estimates under informative than under non-informative censoring.
> After looking again at my code and trying various things I realize that the
> key factor that leads to the informative and non-informative censoring data
> giving rise to the same biased estimates is how I generate my Z_i variable,
> and also the magnitude of the Z_i coefficient in both of the event and
> informative censoring models.
>
> In the example I gave I generated Z_i (I think of this as a "poor prognosis"
> variable) from a beta distribution so that it ranged from 0-1. The biased
> estimates for "beta_t_1" (I think of this as the effect of a treatment on
> survival) were approximately 1.56 when the true value was -1. What I forgot
> to mention was that estimating a cox model with 1,000,000 subjects to the
> full data (i.e. no censoring at all) arguably gives the best treatment
> effect estimate possible given that the effects of Z_i and Z_i*Treat_i are
> not in the model. This "best possible" estimate turns out to be 1.55 - i.e.
> the example I gave just so happens to be such that even with 25-27%
> censoring, the estimates obtained are almost the best that can be attained.
>
> My guess is that the informative censoring does not bias the estimate more
> than non-informative censoring because the only variable not accounted for
> in the model is Z_i which does not have a large enough effect "beta_t_2",
> and/or "beta_c_2", or perhaps because Z_i only has a narrow range which does
> not permit the current "beta_t_2" value to do any damage?
>
> To investigate the "beta_t_2", and/or "beta_c_2" issue I changed "beta_c_2"
> from 2 to 7 and "beta_c_0" from 0.2 to -1.2, and "beta_d_0" from -0.7 to
> -0.4 to keep the censoring %'s equal at about 30%. This time the "best
> possible" estimate of "beta_t_1" was -1.53 which was similar to that
> obtained previously. The informative censoring gave an estimate for
> "beta_t_1" of -1.49 whereas the non-informative censoring gave -1.53 - this
> time the non-informative censoring attains the "best possible" but the
> non-informative censoring does not.
>
>
>
> I then instead changed "beta_t_2" from 1 to 7 and "beta_c_0" from 0.2 to 2,
> and "beta_d_0" from -0.7 to -1.9 again to keep the censoring %'s equal at
> about 30%. This time the "best possible" estimate of "beta_t_1" was -0.999
> which is pretty much equal to the true value of -1. The informative
> censoring gave an estimate for "beta_t_1" of -1.09 whereas the
> non-informative censoring gave -0.87 – surprisingly this time the
> informative censoring is slightly closer to the “best possible” than the
> non-informative censoring.
>
>
>
> To investigate the Z_i issue I generated it from a normal distribution with
> mean 1 and variance 1. I changed "beta_c_0 " from 0.2 to -0.5 to keep the
> censoring levels equal (this time about 30% for both). This time the "best
> possible" estimate was -1.98 which was further from -1 than previous
> examples. The informative censoring gave an estimate for "beta_t_1" of -1.81
> whereas the non-informative censoring gave -1.84. So again the informative
> censoring gives an estimate closer to the "best possible" when compared with
> the informative censoring, but this time it does not attain the "best
> possible".
>
> In conclusion it is clear to me that a stronger Z_i effect in the censoring
> model causes the informative censoring to be worse than the non-informative
> one (as expected), but a stronger Z_i effect in the event model does not
> have this effect and even causes the independent censoring to be worse –
> this in general may not hold but I nonetheless see it here. I am wondering
> if this is because altering the treatment effect in the event model also
> affects the independent censoring process and so it “muddies the waters”
> whereas altering the treatment effect in the informative censoring model
> obviously confines the changes to just the informative censoring process.
> For a fixed treatment effect size in both the event and informative
> censoring models the effect of Z_i having a wider range than is possible
> under the beta distribution also appears to produce informative censoring
> that is worse than the non-informative one. This makes sense I think because
> the Z_i-response relationship must be more informative?
>
>
>
> Thanks for your suggestion of copulas – I have not come across these. Is
> this similar to assuming a event model for censored subjects (this is
> unobserved) – i.e. if the event model is different conditional on censoring
> then if we could observe the events beyond censoring then clearly the
> parameter estimates would be different compared to those obtained when
> modelling only non-censored times?
>
>
>
>
> On Wed, Jul 29, 2015 at 5:37 PM, Greg Snow <538280 at gmail.com> wrote:
>>
>> As models become more complex it becomes harder to distinguish
>> different parts and their effects.  Even for a straight forward linear
>> regression model if X1 and X2 are correlated with each other then it
>> becomes difficult to distinguish between the effects of X1^2, X2^2,
>> and X1*X2.  In your case the informative censoring and model
>> misspecification are becoming hard to distinguish (and it could be
>> argued that having informative censoring is really just a form of
>> model misspecification).  So I don't think so much that you are doing
>> things wrong, just that you are learning that the models are complex.
>>
>> Another approach to simulation that you could try is to simulate the
>> event time and censoring time using copulas (and therefore they can be
>> correlated to give informative censoring, but without relying on a
>> term that you could have included in the model) then consider the
>> event censored if the censoring time is shorter.
>>
>> On Fri, Jul 24, 2015 at 10:14 AM, Daniel Meddings <dpmeddings at gmail.com>
>> wrote:
>> > Hi Greg
>> >
>> > Many thanks for taking the time to respond to my query. You are right
>> > about
>> > pointing out the distinction between what variables are and are not
>> > included
>> > in the event times process and in the censoring process. I clearly
>> > forgot
>> > this important aspect. I amended my code to do as you advise and now I
>> > am
>> > indeed getting biased estimates when using the informatively censored
>> > responses. The problem is now that the estimates from the independently
>> > censored responses are the same - i.e. they are just as biased. Thus the
>> > bias seems to be due entirely to model mis-specification and not the
>> > informative censoring.
>> >
>> >
>> > To give a concrete example I simulate event times T_i and censoring
>> > times
>> > C_i from the following models;
>> >
>> >
>> > T_i~ Weibull(lambda_t(x),v_t),    lambda_t(x)=lambda_t*exp( beta_t_0 +
>> > (beta_t_1*Treat_i) + (beta_t_2*Z_i) + (beta_t_3*Treat_i*Z_i)  )
>> >
>> > C_i~ Weibull(lambda_c(x),v_c),    lambda_c(x)=lambda_c*exp( beta_c_0 +
>> > (beta_c_1*Treat_i) + (beta_c_2*Z_i) + (beta_c_3*Treat_i*Z_i)  )
>> >
>> > D_i~Weibull(lambda_d(x),v_D), lambda_d(x)=lamda_d*exp( beta_d_0)
>> >
>> > where ;
>> >
>> > beta_t_0 = 1,  beta_t_1 = -1,   beta_t_2 = 1,  beta_t_3 = -2,
>> > lambda_t=0.5
>> >
>> > beta_c_0 = 0.2,  beta_c_1 = -2,   beta_c_2 = 2,  beta_c_3 = -2,
>> > lambda_c=0.5
>> >
>> > beta_d_0 = -0.7,  lambda_d=0.1
>> >
>> > When I fit the cox model to both the informatively censored responses
>> > and
>> > the independent censored responses I include only the Treatment
>> > covariate in
>> > the model.
>> >
>> > I simulate Treatment from a Bernoulli distribution with p=0.5 and Z_i
>> > from a
>> > beta distribution so that Z ranges from 0 to 1 (I like to think of Z as
>> > a
>> > "poor" prognosis probability where Z_i=1 means a subject is 100% certain
>> > to
>> > have a poor prognosis and Z_i=0 means zero chance). These parameter
>> > choices
>> > give approximately 27% and 25% censoring for the informatively censored
>> > responses (using C_i) and the independent censored responses (using D_i)
>> > respectively. I use N=2000 subjects and 2000 simulation replications.
>> >
>> > The above simulation I get estimates of beta_t_2 of -1.526 and -1.537
>> > for
>> > the informatively censored responses and the independent censored
>> > responses
>> > respectively.
>> >
>> > Furthermore when I fit a cox model to the full responses (no censoring
>> > at
>> > all) I get an estimate of beta_t_2 of -1.542. This represents the best
>> > that
>> > can possibly be done given that Z and Treat*Z are not in the model.
>> > Clearly
>> > censoring is not making much of a difference here - model
>> > mis-specification
>> > dominates.
>> >
>> > I still must be doing something wrong but I cannot figure this one out.
>> >
>> > Thanks
>> >
>> > Dan
>> >
>> >
>> >
>> > On Thu, Jul 23, 2015 at 12:33 AM, Greg Snow <538280 at gmail.com> wrote:
>> >>
>> >> I think that the Cox model still works well when the only information
>> >> in the censoring is conditional on variables in the model.  What you
>> >> describe could be called non-informative conditional on x.
>> >>
>> >> To really see the difference you need informative censoring that
>> >> depends on something not included in the model.  One option would be
>> >> to use copulas to generate dependent data and then transform the
>> >> values using your Weibul.  Or you could generate your event times and
>> >> censoring times based on x1 and x2, but then only include x1 in the
>> >> model.
>> >>
>> >> On Wed, Jul 22, 2015 at 2:20 AM, Daniel Meddings <dpmeddings at gmail.com>
>> >> wrote:
>> >> > I wish to simulate event times where the censoring is informative,
>> >> > and
>> >> > to
>> >> > compare parameter estimator quality from a Cox PH model with
>> >> > estimates
>> >> > obtained from event times generated with non-informative censoring.
>> >> > However
>> >> > I am struggling to do this, and I conclude rather than a technical
>> >> > flaw
>> >> > in
>> >> > my code I instead do not understand what is meant by informative and
>> >> > un-informative censoring.
>> >> >
>> >> > My approach is to simulate an event time T dependent on a vector of
>> >> > covariates x having hazard function
>> >> > h(t|x)=lambda*exp(beta'*x)v*t^{v-1}.
>> >> > This corresponds to T~ Weibull(lambda(x),v), where the scale
>> >> > parameter
>> >> > lambda(x)=lambda*exp(beta'*x) depends on x and the shape parameter v
>> >> > is
>> >> > fixed. I have N subjects where T_{i}~ Weibull(lambda(x_{i}),v_{T}),
>> >> > lambda(x_{i})=lambda_{T}*exp(beta_{T}'*x_{i}), for i=1,...,N. Here I
>> >> > assume
>> >> > the regression coefficients are p-dimensional.
>> >> >
>> >> > I generate informative censoring times C_i~ Weibull(lambda(x_i),v_C),
>> >> > lambda(x_i)=lambda_C*exp(beta_C'*x_i) and compute
>> >> > Y_inf_i=min(T_i,C_i)
>> >> > and
>> >> > a censored flag delta_inf_i=1 if Y_inf_i <= C_i (an observed event),
>> >> > and
>> >> > delta_inf_i=0 if Y_inf_i > C_i (informatively censored: event not
>> >> > observed). I am convinced this is informative censoring because as
>> >> > long
>> >> > as
>> >> > beta_T~=0 and beta_C~=0 then for each subject the data generating
>> >> > process
>> >> > for T and C both depend on x.
>> >> >
>> >> > In contrast I generate non-informative censoring times
>> >> > D_i~Weibull(lambda_D*exp(beta_D),v_D), and compute
>> >> > Y_ninf_i=min(T_i,D_i)
>> >> > and a censored flag delta_ninf_i=1 if Y_ninf_i <= D_i (an observed
>> >> > event),
>> >> > and delta_ninf_i=0 if Y_ninf_i > D_i (non-informatively censored:
>> >> > event
>> >> > not
>> >> > observed). Here beta_D is a scalar. I "scale" the simulation by
>> >> > choosing
>> >> > the lambda_T, lambda_C and lambda_D parameters such that on average
>> >> > T_i<C_i
>> >> > and T_i<D_i to achieve X% of censored subjects for both Y_inf_i and
>> >> > Y_ninf_i.
>> >> >
>> >> > The problem is that even for say 30% censoring (which I think is
>> >> > high),
>> >> > the
>> >> > Cox PH parameter estimates using both Y_inf and Y_ninf are unbiased
>> >> > when
>> >> > I
>> >> > expected the estimates using Y_inf to be biased, and I think I see
>> >> > why:
>> >> > however different beta_C is from beta_T, a censored subject can
>> >> > presumably
>> >> > influence the estimation of beta_T only by affecting the set of
>> >> > subjects
>> >> > at
>> >> > risk at any time t, but this does not change the fact that every
>> >> > single
>> >> > Y_inf_i with delta_inf_i=1 will have been generated using beta_T
>> >> > only.
>> >> > Thus
>> >> > I do not see how my simulation can possibly produce biased estimates
>> >> > for
>> >> > beta_T using Y_inf.
>> >> >
>> >> > But then what is informative censoring if not based on this approach?
>> >> >
>> >> > Any help would be greatly appreciated.
>> >> >
>> >> >         [[alternative HTML version deleted]]
>> >> >
>> >> > ______________________________________________
>> >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >> > PLEASE do read the posting guide
>> >> > http://www.R-project.org/posting-guide.html
>> >> > and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >>
>> >>
>> >> --
>> >> Gregory (Greg) L. Snow Ph.D.
>> >> 538280 at gmail.com
>> >
>> >
>>
>>
>>
>> --
>> Gregory (Greg) L. Snow Ph.D.
>> 538280 at gmail.com
>
>



-- 
Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com



More information about the R-help mailing list