[Rd] match and unique

Therneau, Terry M., Ph.D. therneau at mayo.edu
Wed Mar 16 16:03:58 CET 2016


Is the phrase  "index <- match(x, sort(unique(x)))" reliable, in the sense that it will 
never return NA?

Context: Calculation of survival curves involves the concept of unique death times.  I've 
had reported cases in the past where survfit failed, and it was due to the fact that two 
"differ by machine precision" values would sometimes match and sometimes not, depending on 
how I compared them.  I've dealt with those piecemeal in the past, but am going to do a 
code review and make sure that I do things consistently throughout the survival package.  
The basic plan will be to change time to an integer, do all the work, then restore labels 
at the end.  The above line is one candidate for the first step.

An alternative is index <- as.numeric(factor(x)), with as.numeric(levels(factor(x))) as 
the final labeling step.  This is a more severe rounding, is it not?  But perhaps it is 
preferable? The KM branch of the current survfit routine does this, and I've had one user 
report a bug in that
     x <- runif(20)
     fit <- survfit(Surv(x) ~1)
     summary(fit, times=x)
will produce lines with 0, 1 or 2 events when "they all should be 1".

The same issue just came up in an rpart example, sent to me.  For coxph models is may only 
be a matter of time.

Suggestions and opinions are welcome.

Terry Therneau



More information about the R-devel mailing list