[Rd] match and unique

Thu Mar 17 21:20:57 CET 2016

Hi Terry,

On 03/16/2016 08:03 AM, Therneau, Terry M., Ph.D. wrote:
> Is the phrase  "index <- match(x, sort(unique(x)))" reliable, in the
> sense that it will never return NA?

This is assuming that match() and unique() will never disagree on
equality between 2 floating point values. I believe they share some
code internally (same hashing routine?), so maybe it's reliable.

Anyway, it's always preferable to not rely on this kind of assumption.

A safer thing to do is to use rank():

   r <- rank(x, ties.method="min")  # could use "max"

Think of 'r' as a unique ID assigned to each value in 'x'. This ID takes
its values in the (1,length(x)) range but we want it to take its values
in the (1,length(unique(x))) range:

   ID_remapping <- cumsum(tabulate(r, nbins=length(r)) != 0L)
   index <- ID_remapping[r]

'index' will be the same as 'match(x, sort(unique(x))' but doesn't rely
on the assumption that match() and unique() agree on equality between
2 floating point values.

Unfortunately rank() is very slow, much slower than sort(). Here is a
faster solution based on sort.list(x, na.last=NA, method="quick"):

   assignID <- function(x)
   {
     oo <- sort.list(x, na.last=NA, method="quick")
     sorted <- x[oo]
     is_unique <- c(TRUE, sorted[-length(sorted)] != sorted[-1L])
     sorted_ID <- cumsum(is_unique)
     ID <- integer(length(x))
     ID[oo] <- sorted_ID
     ID
   }

'assignID(x)' is also slightly faster than 'match(x, sort(unique(x)))':

   x <- runif(5000000)
   system.time(index1 <- match(x, sort(unique(x))))
   #   user  system elapsed
   #  2.170   0.552   2.725

   system.time(index2 <- assignID(x))
   #   user  system elapsed
   #  0.885   0.032   0.917

   identical(index1, index2)
   # [1] TRUE

Cheers,
H.

>
> Context: Calculation of survival curves involves the concept of unique
> death times.  I've had reported cases in the past where survfit failed,
> and it was due to the fact that two "differ by machine precision" values
> would sometimes match and sometimes not, depending on how I compared
> them.  I've dealt with those piecemeal in the past, but am going to do a
> code review and make sure that I do things consistently throughout the
> survival package. The basic plan will be to change time to an integer,
> do all the work, then restore labels at the end.  The above line is one
> candidate for the first step.
>
> An alternative is index <- as.numeric(factor(x)), with
> as.numeric(levels(factor(x))) as the final labeling step.  This is a
> more severe rounding, is it not?  But perhaps it is preferable? The KM
> branch of the current survfit routine does this, and I've had one user
> report a bug in that
>      x <- runif(20)
>      fit <- survfit(Surv(x) ~1)
>      summary(fit, times=x)
> will produce lines with 0, 1 or 2 events when "they all should be 1".
>
> The same issue just came up in an rpart example, sent to me.  For coxph
> models is may only be a matter of time.
>
> Suggestions and opinions are welcome.
>
> Terry Therneau
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319