[Rd] Proposal: Generalizing unique() and duplicated()

Tue, 6 Feb 2001 11:25:39 +0100

Prof. Ripley wrote on r-help:

> Completely distinct row vectors?  Take a look at the code of
> merge.data.frame.  Something like
>
>     bx <- matrix(as.character(a), nrow(a))
>     bx <- drop(apply(bx, 1, function(x) paste(x, collapse = "\r")))
>     length(unique(bx))
>
> This turns each row into a single character string, and counts the unique
> ones.

Hmmm... couldn't one build on this in order to generalize the 
unique() function?

I'm asking because when I once tried to use unique() on a matrix (to collapse 
duplicate rows), I found that it and duplicated() work only on vectors. I 
think a generalization, at least for matrices and simple data.frames, would 
be useful.

I tried my hand at it and came up with this:

----------------------------------------------------

"unique.default" <- get("unique", pos="package:base")    # old version becomes
                                                         # default behaviour
"unique" <- function(object, ...)
{
   if (data.class(object)=="matrix")
       return(unique.matrix(object, ...))
   else
       UseMethod("unique")      # doesn't seem to work for matrices, hence 
}                               # the condition

"duplicated.default" <- get("duplicated", pos="package:base")	

"duplicated" <- function(object, ...)
{
   if (data.class(object)=="matrix")
       return(duplicated.matrix(object, ...))
   else
       UseMethod("duplicated")  
}

"duplicated.matrix" <-
  function(mat, MARGIN=1)    # defaulting to work on rows
{
  strvect <- drop(apply(mat, MARGIN, function(x) paste(x, collapse = "\r")))
  return(duplicated(strvect))
}

"unique.matrix" <-
  function(mat, MARGIN=1)    # defaulting to work on rows
{
  dup <- duplicated(mat, MARGIN)
  return(if (MARGIN==1) mat[!dup,] else mat[,!dup])
}

"duplicated.data.frame" <-
  function(df, MARGIN=1)
{
  strvect <- drop(apply(as.matrix(df), MARGIN, function(x) paste(x, collapse 
= "\r")))
  duplicated(strvect)
}

"unique.data.frame" <-
  function(df, MARGIN=1)
{
  dup <- duplicated(df, MARGIN)
  return(if (MARGIN==1) df[!dup,] else df[,!dup])
}

----------------------------------------------------

I couldn't figure out how to generalize to more than two dimensions (more 
accurately, how to subset in the dimension given by the variable MARGIN). 

Does anybody else consider this useful?

Cheers

Kaspar Pflugshaupt
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._