[R] prediction based on conditional logistic regression clogit

Tue Jun 17 02:48:28 CEST 2014

peter dalgaard <pdalgd <at> gmail.com> writes:

> 
> 
> On 16 Jun 2014, at 05:22 , array chip <arrayprofile <at> yahoo.com> wrote:
> 

> > Hi, I am using clogit() from survival package to do conditional
> > logistic regression. I also need to make prediction on an
> > independent dataset to calculate predicted probability. Here is an
> > example:

[snip]

> > Can anyone suggest what's wrong here?
> > 

>  The direct cause is that clogit() works by using the fact that the
> likelihood is equivalent to a coxph() likelihood with stratification
> and all observation lengths set to 1. Therefore the analysis is
> formally on Surv(rep(1, 150L), status) and that goes belly-up if you
> apply the same formula to a data set of different length.

>  However, notice that there is no such thing as predict.clogit(), so
> you are attempting predict.coxph() on a purely formal Cox model. It
> is unclear to what extent predicted values, in the sense of coxph()
> are compatible with predictions in conditional logit models.

> 
> I'm rusty on this, but I think what you want is something like
> 
> m <- model.matrix(~ x1 + x2 - 1, data=dat.test)
> pp <- exp(m %*% coef(fit))
> pps <- ave(pp, dat.test$set, FUN=sum)
> pp/pps
> 

> i.e. the conditional probability that an observation is a case given
> covariates and that there is on case in each set (in the data given,
> you have sets of three with one being a case, so all predicted
> probabilities are close to 0.33). For more general matched sets, I'm
> not really sure what one wants. Real experts are welcome to chime
> in.

For the general situation of n cases in a stratum of size N, you want the
probability that the unit in question is one of n units drawn from a
stratum of size N without replacement with unequal probabilities of
selection over the units.

I am *not* an expert on that, but there is plenty written on it.

 Horvitz, Daniel G., and Donovan J. Thompson. "A generalization of
 sampling without replacement from a finite universe." Journal of the
 American Statistical Association 47.260 (1952): 663-685.

is a place to start.

The probability in question is a sum over the

    factorial(n)*choose(N-1,n-1)) 

elements corresponding to the number of samples (and orders) that
include a chosen element.

Of course, for n=1 there is just the one element, pp/pps.

HTH,

Chuck