[R] survival analysis using rpart

Walter345 walter345 at yahoo.com
Wed Feb 28 16:23:21 CET 2007



Thanks a lot for the reply, Terry!

Concerning my question #3, I was thinking about the following scenario:

Suppose that you have a data set of survival time data. We use rpart on two
different subsets of the data set, where the columns contain the covariates.
Say, for instance, we build the model first using columns #1 to #10 (call it
model A), and second using columns #11 to #20 (call it model B). For both
subsets, we perform 10-fold cross-validation. 

Let’s consider one case only. Let this case be censored after 30 time units.
Model A puts this case into a leaf with an estimated event rate of 3.56
while model B puts this case into a leaf with an estimated rate of 0.12. I
wish to use this rate to predict the outcome of the case (outcome = event or
outcome = non-event). A rate > 1 is associated with higher chance of an
event, whereas a rate smaller than 1 is associated with a lower chance.
Hence, model A predicts an event for this case, while model B does not.
Thus, model B makes a better prediction than A (for this case). 

Does this make sense or do I misinterpret the event rate? And is it
reasonable to choose a threshold of 1? 

Thanks a lot for all comments!
Walter






Walter345 wrote:
> 
> Hello,
> 
> I use rpart to predict survival time and have a problem in interpreting
> the output of “estimated rate”. Here is an example of what I do:
> 
>> stagec <-
>> read.table("http://www.stanford.edu/class/stats202/DATA/stagec.data", 
>> col.names=c("pgtime", "pgstat", "age","eet", "g2", "grade", "gleason",
>> "ploidy"))
> 
>> fit <- rpart(Surv(pgtime, pgstat) ~ age + eet + g2 + grade + gleason +
>> ploidy, data=stagec)
> 
> 
> Result:
> 
> 1) root 146 195.411600 1.0000000  
>    2) grade< 2.5 61  45.021520 0.3624701  
>      4) g2< 11.36 33   9.120116 0.1225562 *
>      5) g2>=11.36 28  27.804100 0.7335298  
>       10) gleason< 5.5 20  14.376900 0.5292190 *
>       11) gleason>=5.5 8  11.201470 1.3083680 *
>    3) grade>=2.5 85 125.327400 1.6190620  
>      6) age>=56.5 75 104.154700 1.4287310  
>       12) gleason< 7.5 50  66.701410 1.1431320 *
>       13) gleason>=7.5 25  33.993130 2.0355220  
>         26) g2>=15.29 13  16.555970 1.3494740 *
>         27) g2< 15.29 12  14.220260 2.9210480 *
>      7) age< 56.5 10  15.522810 3.1977430 *
> 
> Let’s look at the terminal node 4:
> 
> #	PGTIME	PGSTAGE	AGE	EET	G2	GRADE	GLEASON	PLOIDY
> 1	8.657084	0	70	1	4.43	1	3	1
> 2	16.70088	0	56	2	5.29	1	3	1
> 3	3.162217	1	62	2	3.57	2	4	1
> 4	10.20123	0	63	2	5.14	2	5	1
> 5	4.479124	0	63	2	5.75	2	5	1
> 6	6.516084	0	66	2	5.92	2	5	1
> 7	4.936345	0	67	2	6.41	2	5	1
> 8	10.79808	0	72	1	6.68	2	NA	1
> 9	9.174537	0	62	1	6.74	2	5	1
> 10	10.87474	0	72	2	6.8	2	5	1
> 11	7.028062	0	52	2	7.15	2	7	1
> 12	11.36481	0	59	2	7.61	2	5	1
> 13	10.17659	0	64	1	7.61	2	NA	1
> 14	6.96783	0	67	2	7.78	2	6	1
> 15	10.61738	0	55	2	7.81	2	5	1
> 16	6.510609	0	70	1	7.88	2	6	1
> 17	10.36276	0	55	2	8.1	2	5	1
> 18	6.694045	0	54	2	8.11	2	4	1
> 19	11.718	0	61	2	8.4	2	5	1
> 20	7.301847	0	69	2	8.46	2	5	1
> 21	6.067077	0	69	2	8.58	2	6	1
> 22	8.353182	0	59	2	8.76	2	6	1
> 23	5.541409	0	59	1	9.01	2	5	1
> 24	5.492128	0	61	2	9.42	2	5	1
> 25	7.208761	0	63	1	9.76	2	5	1
> 26	6.004106	0	52	2	9.9	2	4	1
> 27	5.664613	0	71	1	10.16	2	6	1
> 28	6.130047	0	64	2	10.26	2	4	1
> 29	9.812457	0	64	1	10.51	2	5	1
> 30	6.275154	0	62	2	10.82	2	6	1
> 31	9.253935	0	61	2	11.23	2	5	1
> 32	5.201916	0	54	2	11.35	2	6	1
> 33	6.22861	0	65	2	11.35	2	5	1
> 
> Here we have 33 observations and 1 event. The “estimated rate” is
> 0.1225562. My questions are:
> 
> (1) Is the “estimated rate” the estimated hazard rate ratio? 
> (2) How does rpart calculate this rate?
> (3) Suppose I use xpred.rpart(fit, xval=10) to perform 10-fold
> cross-validation using (a) the complete stagec data set and (b) only a
> subset of it, say, using the columns Age, EET, and G2 only. For the i-th
> patient, I am likely to obtain a different estimated rate. How can I
> meaningfully compare both rates? How can say which one is “better”? 
> 
> Thanks a lot for all comments!
> Walter
> 
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/survival-analysis-using-rpart-tf3294276.html#a9205816
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list