[R] survfit is too slow! Looking for an alternative

Wed Feb 8 14:38:07 CET 2012

A couple of thoughts.
1. More than 1/2 the work for survfit.coxph is computing standard
errors.  If you don't need them adding se.fit=FALSE will help the speed.

2. Survival curves with time dependent covariates is a complex topic.
To get the "probability of default in each month during next 2 years"
you need to create a scenario that specifies exactly what those time
dependent covariates will do over the next two years.  My book has a
long discussion on this, could I suggest you borrow a copy and read it?

3. For time fixed covariates there is a simple formula to get S(t;x)
from S(t; x0), i.e., if you have the predicted curve for some covariate
choice x0 you can easily derive it for any other chosen x.  That formula
doesn't work in the time dependent variable case (you can't factor
exp(x) out from under an integral of [exp(x) g(t) dt] when x is a
function of t).  Unless you want to learn a lot more math and do custom
programming, I think you are stuck with survfit.

Terry Therneau

---------- begin included message ----------------
I found survfit function was very slow for a large
dataset and I am looking for an alternative way to quickly get the
predicted
survival probabilities.

My
historical data set is a pool of loans with monthly observed default
status for
24 months. I would like to fit the proportional hazard model with time
varying
covariate such as unemployment rates and time constant variables at loan
application in a counting process format, and then use the model to
predict the
probability of default in each month during next 2 years for a pool of
new
loans.

I have read some posts from other R users. It sounds like using (average
survival
probability)^exp((X-means(X)*Beta) can quickly get the predicted
survival
probabilities. My predictors for the model include both continuous
variables
and categorical variables and my dataset is in counting process format
with
both time varying and time constant predictors. So how should I take the
mean?
I guess it's the mean of training data? And the denominator for the mean
is the
number of observations (i.e, the number of rows of training data in the
counting
process format)? What if the predictor is a categorical variable?