[R] Cox regression model for matched data with replacement

Therneau, Terry M., Ph.D. therneau at mayo.edu
Wed Aug 13 15:24:07 CEST 2014

Ok, I will try to do a short tutorial answer.

1. The score statistic for a Cox model is a sum of (x - xbar), where "x" is the covariate 
vector of the subject who had an event, and xbar is the mean covariate vector for the 
population, at that event time.
   - the usual Cox model uses the mean of {everyone still at risk} as xbar
   - matched Cox models use a mean of {some subset of those at risk}, and work fine as 
long as that subset is an honest estimate of xbar.  You do, of course, have to sample from 
those still at risk at the time point, since that is the xbar you are trying to estimate. 
  Someone who dies or is censored at time 10 can't be a control at time 20.
   - in an ordinary Cox model the program figures out who belongs in each xbar average all 
on its own, using the time variable.  In a matched model you need to supply the "who 
dances with who" information.  The usual way is to assign each of the sets {subject who 
died + their controls} to a separate stratum.  (If there is only one death in each stratum 
then the time variable will not be needed and you can plug in a dummy value; this is what 
clogit does.)  You can have more than one control per case by the way.

2. Variance.  In the matched model you run the risk, a quite small risk, that the same 
person would be picked again and again as the control.  If this unfortunate thing were to 
happen then the usual model based variance would be too optimistic --- because of its 
overdependence on one single subject the fit is more unstable than it looks.  Three 
solutions: a) don't worry about it (my usual approach),  b) when selecting controls, 
ensure that this doesn't happen (classic matched case control),  c) use a robust variance. 
  For the latter make sure that each subject in the data set has a unique value for some 
variable "id" and add "+ cluster(id)" to the model statement.

3. The most common mistake in matching is to exclude, at a given death time t, any subject 
with a future event from the list of potential controls at time t.  This does not lead to 
an unbiased estimate of xbar, and the resulting numerical bias in the coefficients is 
shockingly large.
   There are more clever ways to pick the subset at each event time, e.g., if you had some 
prior information on all the subjects that can classify them into high/medium/low risk. 
Survey sampling principles come into play for selection and the xbar at each time is 
replaced with an appropriate weighted survey estimate.  See various papers by Brian Langholz.

Terry T

On 08/13/2014 07:26 AM, John Pura wrote:
> Hi Dr. Therneau,
> The original question on the forum was:
> My problem was how to build a Cox model for the matched data (1:n) with
> replacement. Usually, we can use stratified Cox regression model when the
> data were matched without replacement. However, if the data were matched
> with replacement, due to the re-use of subjects, we should give a weight
> for each pair, then how to incorporate this weight into a Cox model. I also
> checked the "clogit" function in survival package, it seems suitable to the
> logistic model for the matched data with replacement, rather than Cox
> model. Because it sets the time to a constant. Anyone can give me some
> suggestions?
> I’m facing a very similar situation, in which I have multiple controls to multiple cases.
> How would I go about taking that dependency into account in a Cox model? Is this weighting
> appropriate and to get robust sandwich estimates, can I take my id variable to be the id
> for the unique cases?
> Thanks,
> John

> On 08/13/2014 05:00 AM, John Purda wrote:
>> I am curious about this problem as well. How do you go about creating the weights for each pair, and are you suggesting that we can just incorporate a weight statement in the model as opposed to the strata statement? And Dr. Therneau, let's say I have 140 cases matched with replacement to 2 controls. Is my id variable the number of cases?
>   The above has an incorrect assumption that I notice ALL survival questions on the list
> -- which was false in this case.  Could you clue me in as to the original question and
> discussion -- assuming that you want "Dr Therneau" to respond intelligently :-)
> Terry T.

More information about the R-help mailing list