[Rd] match and unique
Therneau, Terry M., Ph.D.
therneau at mayo.edu
Wed Mar 16 16:03:58 CET 2016
Is the phrase "index <- match(x, sort(unique(x)))" reliable, in the sense that it will
never return NA?
Context: Calculation of survival curves involves the concept of unique death times. I've
had reported cases in the past where survfit failed, and it was due to the fact that two
"differ by machine precision" values would sometimes match and sometimes not, depending on
how I compared them. I've dealt with those piecemeal in the past, but am going to do a
code review and make sure that I do things consistently throughout the survival package.
The basic plan will be to change time to an integer, do all the work, then restore labels
at the end. The above line is one candidate for the first step.
An alternative is index <- as.numeric(factor(x)), with as.numeric(levels(factor(x))) as
the final labeling step. This is a more severe rounding, is it not? But perhaps it is
preferable? The KM branch of the current survfit routine does this, and I've had one user
report a bug in that
x <- runif(20)
fit <- survfit(Surv(x) ~1)
will produce lines with 0, 1 or 2 events when "they all should be 1".
The same issue just came up in an rpart example, sent to me. For coxph models is may only
be a matter of time.
Suggestions and opinions are welcome.
More information about the R-devel