[Rd] duplicates() function

Sat Apr 9 20:09:34 CEST 2011

On Fri, Apr 08, 2011 at 10:59:10AM -0400, Duncan Murdoch wrote:
> I need a function which is similar to duplicated(), but instead of 
> returning TRUE/FALSE, returns indices of which element was duplicated.  
> That is,
> 
> > x <- c(9,7,9,3,7)
> > duplicated(x)
> [1] FALSE FALSE  TRUE FALSE TRUE
> 
> > duplicates(x)
> [1] NA NA  1 NA  2
> 
> (so that I know that element 3 is a duplicate of element 1, and element 
> 5 is a duplicate of element 2, whereas the others were not duplicated 
> according to our definition.)
> 
> Is there a simple way to write this function?

A possible strategy is to use sorting. In a sorted matrix
or data frame, the elements, which are duplicates of the
same element, form consecutive blocks. These blocks may
be identified using !duplicated(), which determines the
first elements of these blocks. Since sorting is stable,
when we map these blocks back to the original order, the
first element of each block is mapped to the first ocurrence
of the given row in the original order.

An implementation may be done as follows.

  duplicates <- function(dat)
  {
      s <- do.call("order", as.data.frame(dat))
      non.dup <- !duplicated(dat[s, ])
      orig.ind <- s[non.dup]
      first.occ <- orig.ind[cumsum(non.dup)]
      first.occ[non.dup] <- NA
      first.occ[order(s)]
  }

  x <-  cbind(1, c(9,7,9,3,7) )
  duplicates(x)
  [1] NA NA  1 NA  2

The line

      orig.ind <- s[non.dup]

creates a vector, whose length is the number of non-duplicated
rows in the sorted "dat". Its components are indices of the
corresponding first occurrences of these rows in the original
order. For this, the stability of the order is needed.

The lines

      first.occ <- orig.ind[cumsum(non.dup)]
      first.occ[non.dup] <- NA

expand orig.ind to a vector, which satisfies: If i-th row of the
sorted "dat" is duplicated, then first.occ[i] is the index of the
first row in the original "dat", which is equal to this row. So, the
values in first.occ are those, which are required for the output
of duplicates(), but they are in the order of the sorted "dat". The
last line 

  first.occ[order(s)]

reorders the vector to the original order of the rows.

Petr Savicky.