[R] Processing time on clogit
Terry Therneau
therneau at mayo.edu
Thu Dec 22 18:24:15 CET 2011
--- begin included message ---
I'm trying to run a conditional logistic regression in R (2.14.0) using
clogit from the survival package. The dataset I have is relatively small
(300 observations) with 25 matched strata- there are roughly 2 controls
for
each case, and some strata have multiple case/control groups. When I try
to
fit a very simple model with a binary outcome and a single continuous
exposure R seems to freeze for a while, and 30 minutes later I receive
the
results. However, when I run the exact same conditional logistic
regression
model in STATA 10, the exact same answers are produced in <1 second.
(Same
coefficients, LR test resutls, standard errors, etc.). I just tried
running
an expanded model with covariates that I'd like to control for, but R
has
been unresponsive and I doubt it will resolve itself anytime soon. The
syntax I'm using is:
library(survival)
clogit(binary_out~contin_exp + strata(id), data=data)
-------------- end inclusion --------------
Quick answer: use method='approximate'
Long answer: A stratified Cox model with time=constant,
status:1=case,0=control, and using the "exact partial likelihood" has
exactly the same likelihood formula as a conditional logistic
regression. I've never settled my mind on whether this equivalence is a
deep statistical connection or just lucky algebra. Nevertheless, the
clogit routine does it's work by creating a dummy variable of times (all
1, but any constant would do) and then calling coxph.
In coxph, the computation of the exact partial likelihood can be very,
very, very slow. If a particular strata had say 10 events and 20
subjects it has to add up a denominator that involves all possible
choices of 10 out of 20, which is 20!/(10! 10!) = 184756 terms. Most of
the time conditional logisitic is run on case/control pairs or triplets
(case + 2 controls) which involves choosing 1 out of n and this issue
does not arise.
There is a very clever but complex algorithm that involves nested
subsets, I forget the authors, which some packages have implemented and
which speeds this up considerably. From your comments above I'm guessing
Stata is on that list. The survival package in R is not.
a. The case comes up rarely, and when it does a coxph call using the
'efron' approximation is usually very close. A model with the 'breslow'
approx is usually ok. (The clogit routine invokes exact or Breslow, I
think for historical reasons, but it is one of few parts of survival
that I didn't write so I can't be definitive.)
b. Yup, it should be added someday, and it's on my list. But there
are 10-15 other things ahead of it, and it's been that way for 15 years.
Volunteers welcomed.
Terry Therneau
More information about the R-help
mailing list