# [R] covariate selection in cox model (counting process)

Mayeul KAUFFMANN mayeul.kauffmann at tiscali.fr
Wed Jul 28 19:08:35 CEST 2004

```> If you can get the conditional independence (martingaleness) then, yes,
> BIC is fine.
>
> One way to check might be to see how similar the standard errors are
with
> and without the cluster(id) term.

(Thank you "again !", Thomas.)

At first look, the values seemed very similar (see below, case 2).
However, to check this without being too subjective, and without a
specific test, I needed other values to assess the size of the
differences: what is similar, what is not?

==========================================================================
=====
CASE 1
I first estimated the model without modeling dependence:

Call:
coxph(formula = Surv(start, stop, status) ~ cluster(ccode) +
pop + pib + pib2 + crois + instab.x1  + instab.autres, data = xstep)

coef exp(coef) se(coef) robust se     z       p
pop            0.3606     1.434   0.0978    0.1182  3.05 2.3e-03
pib           -0.5947     0.552   0.1952    0.1828 -3.25 1.1e-03
pib2          -0.4104     0.663   0.1452    0.1270 -3.23 1.2e-03
crois         -0.0592     0.943   0.0245    0.0240 -2.46 1.4e-02
instab.x1      2.2059     9.079   0.4692    0.4097  5.38 7.3e-08
instab.autres  0.9550     2.599   0.4700    0.4936  1.93 5.3e-02

Likelihood ratio test=74  on 6 df, p=6.2e-14  n= 7286

There seems to be a strong linear relationship between standard errors
(se, or naive se) and robust se.

>      summary(lm(sqrt(diag(cox1\$var))~ sqrt(diag(cox1\$naive.var)) -1))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
sqrt(diag(cox1\$naive.var))  0.96103    0.04064   23.65 2.52e-06 ***
Multiple R-Squared: 0.9911, Adjusted R-squared: 0.9894

==========================================================================
=====
CASE 2

Then I added a variable (pxcw) measuring the proximity of the previous
event (1>pxcw>0)

n= 7286
coef exp(coef) se(coef) robust se     z       p
pxcw           0.9063     2.475   0.4267    0.4349  2.08 3.7e-02
pop            0.3001     1.350   0.1041    0.1295  2.32 2.0e-02
pib           -0.5485     0.578   0.2014    0.1799 -3.05 2.3e-03
pib2          -0.4033     0.668   0.1450    0.1152 -3.50 4.6e-04
crois         -0.0541     0.947   0.0236    0.0227 -2.38 1.7e-02
instab.x1      1.9649     7.134   0.4839    0.4753  4.13 3.6e-05
instab.autres  0.8498     2.339   0.4693    0.4594  1.85 6.4e-02

Likelihood ratio test=78.3  on 7 df, p=3.04e-14  n= 7286

Estimate Std. Error t value Pr(>|t|)
sqrt(diag(cox1\$naive.var))  0.98397    0.02199   44.74 8.35e-09 ***
Multiple R-Squared: 0.997, Adjusted R-squared: 0.9965

The naive standard errors (se) seem closer to the robust se than they were
when not modeling for dependence.
0.98397 is very close to one, R^2 grew, etc.
The dependence is high (risk is multiplied by 2.475 the day after an
event)
but conditional independence (given covariates) seems hard to reject.

==========================================================================
=====
CASE 3
Finally, I compared these results with those without repeated events
(which gives a smaller dataset). A country is removed as soon as we
observe its first event.
(robust se is still computed, even if naive se should in fact be used here
to compute the pvalue)

coxph(formula = Surv(start, stop, status) ~ cluster(ccode) +
pop + pib + pib2 + crois + instab.x1  + instab.autres, data =
xstep[no.previous.event, ])

coef exp(coef) se(coef) robust se     z       p
pop            0.4236     1.528   0.1030    0.1157  3.66 2.5e-04
pib           -0.7821     0.457   0.2072    0.1931 -4.05 5.1e-05
pib2          -0.3069     0.736   0.1477    0.1254 -2.45 1.4e-02
crois         -0.0432     0.958   0.0281    0.0258 -1.67 9.5e-02
instab.x1      1.9925     7.334   0.5321    0.3578  5.57 2.6e-08
instab.autres  1.3571     3.885   0.5428    0.5623  2.41 1.6e-02

Likelihood ratio test=66.7  on 6 df, p=1.99e-12  n=5971 (2466 observations
deleted due to missing)

>      summary(lm(sqrt(diag(cox1\$var))~ sqrt(diag(cox1\$naive.var)) -1))
Estimate Std. Error t value Pr(>|t|)
sqrt(diag(cox1\$naive.var))  0.86682    0.07826   11.08 0.000104 ***
Residual standard error: 0.06328 on 5 degrees of freedom
Multiple R-Squared: 0.9608, Adjusted R-squared: 0.953

There seems to be no evidence that robust se is more different from se in
case 2 than in case 3 (and case 1).
It even seems closer.

I conclude that conditional independence (martingaleness) cannot be
rejected in CASE 2, when modeling the dependence between events with a
covariate.

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

> > Then, there is still another option. In fact, I already modelled
> > explicitely the influence of past events with a "proximity of last
event"
> > covariate, assuming the dependence on the last event decreases at a
> > constant rate (for instance, the proximity covariate varies from 1 to
0.5
> > in the first 10 years after an event, then from 0.5 to 0.25 in the
next
> > ten years, etc).
> >
> > With a well chosen modelisation of the dependence effect, the events
> > become conditionnaly independent, I do not need a +cluster(id) term,
and I
> > can use fit\$loglik to make a covariate selection based on BIC, right?
>
> If you can get the conditional independence (martingaleness) then, yes,
> BIC is fine.
>
> One way to check might be to see how similar the standard errors are
with
> and without the cluster(id) term.

```