[R] coxph linear.predictors

Tue Nov 2 21:27:34 CET 2010

Re: 1. X*beta  != linear.predictor.  

The equality is stated in three different help docs, which is misleading, especially in light of the way glm is set up. I felt like was wrestling with SAS :-)
The relative risk was the original idea behind cox regression, but it can be used for many non-relative purposes. If we want to calculate death probability in each period, then lp is no longer shift invariant.

Re: 2. Survfit is too slow.
It seems that the implementation follows the procedure in the original Cox paper, which calls iterative optimization for each death time.
My subjects are mortgages and both the estimation and the prediction samples are several hundred thousand. The call appears to recalculate/optimize everything even though only the $surv changes. Since each subject belongs to a single strata, most of the calculations are redundant.
I am not much of a programmer and could never figure out how to use the R profiler, so cannot be exact here, but the simple exponentiation takes no time and survfit takes several secs for each subject.
So I did:

survlong <- survfit(modlong) # a single call suffices
bl1 <- c(1,cumsum(survlong$strata)+1)
bl2 <- cumsum(survlong$strata) # get the start and end of each strata
for (jj in 1:nrow(newapp)){

  strat=as.integer(newapp[jj,"termfac"])
  surv <- survlong$surv[(b1[strat]):(b2[strat])] # extract the strata
  risk <- predict(modlong,new=newapp[jj,],type="risk")# it seems there is no
                                                      # optimization here
  newsurv <- surv^risk # we done
... rest of code
}

As a package maintainer, you have to decide whether including any of the above and below is useful or users can figure out things on their own. Or maybe survfit can be made smart and subsequent calls on the same model will use the first call to survfit?? It's your call :-)

Kind regards

Stephen B
-----Original Message-----
From: Terry Therneau [mailto:therneau at mayo.edu] 
Sent: Thursday, October 28, 2010 6:39 PM
To: Bond, Stephen; David Winsemius
Cc: r-help at r-project.org
Subject: Re: [R] coxph linear.predictors

Gentlemen,
  I read R-news in batch mode so I'm often a day behind.  Let me try to
answer some of the questions.

 1. X*beta  != linear.predictor.  
I'm sorry if the documentation isn't all it could be.  Between the book,
tech report, and help I've written about 400 pages, but this particular
topic isn't yet in it.  The final snipe about being "opaque like SAS"
was really unfair.
The Cox model is a relative risk model, if lp is a linear predictor then
so is lp +c for any constant; they are equally good and equally valid.
The linear.predictor component in a coxph fit is (X-means) * beta.  The
computation exp(lp) occurs multiple times downstream and this keep the
exp function from overflowing when there is something like a Date object
as a predictor.  Adding this constant changes not a single downstream
calcuation.

2. Survfit is too slow.
 I'd like to hear more about this.  My work mostly involves modest data
sets so perhaps I haven't seen it.  Accuracy and maintainability have
been my first worries.

3. Baseline survival.
 Let xbase be a particular set of values for the x covariates (one for
each).  The survival curve for a given xbase is obtained from survfit
   fit <- coxph(....
   sfit <- survfit(fit, newdata=xbase)
   chaz <- -log(sfit$surv)  #cumulative hazard
(The xbase vector will need to have variable names for the function to
know which value goes to which of course).

The cumulative hazard for any other subject will be 
   newhaz <- chaz * exp(fit$coef%*% (x-xbase))
There is not a simple transformation of the standard error from one fit
to another, however.  You will need to call survfit with a data frame
for newdata, which will return one curve per row with the proper values.

In my view there is no such thing as "A" baseline survival curve.  Any
xbase you chose is a baseline.  However, it is wise to choose something
near the center of the data space in order to avoid numeric problems
with the exp function above.  I would never ever chose a vector of
zeros, although some text books do -- it saves them about 8 characters
of typing in the newhaz formula above.

Terry Therneau