[R] Tagging identical rows of a matrix

Liaw, Andy andy_liaw at merck.com
Sat May 15 03:20:21 CEST 2004


The problem with interaction() is that it doesn't scale with increasing
number of columns:

> set.seed(1)
> mat2 <- matrix(sample(20,5e4,rep=T), 1e4)
> invisible(gc()); system.time(z0 <- f0(mat2))
[1] 1.58 0.01 1.85   NA   NA
> invisible(gc()); system.time(z1 <- f1(mat2))
[1] 1.57 0.00 1.66   NA   NA
> invisible(gc()); system.time(z2 <- f2g(mat2))
[1] 34.14  0.60 57.45    NA    NA

[f2g is the slightly modified version of f2 to allow for any number of
columns:
f2g <- function(mat) as.numeric(interaction(as.data.frame(mat), drop=T))]

With 10 columns in the matrix, f0 and f1 ran fine in under 10 seconds, but
f2g started thrashing, and ran out of memory after a while.  If you look at
how interaction() is written you'll quickly see why...

Andy

> From: Gabor Grothendieck
> 
> Waichler, Scott R <Scott.Waichler <at> pnl.gov> writes:
> 
> > 
> > Thanks to all of you who responded to my help request.
> > Here is the very efficient upshot of your advice:
> > 
> > > mat2 <- apply(mat, 1, paste, collapse=":")
> > > vec <- match(mat2, unique(mat2))
> > > vec
> > [1] 1 2 1 1 2 3
> > 
> > 
> > P.S.  I found that Andy Liaw's method didn't preserve the
> > index order that I wanted; it yields
> > 
> > 2 3 2 2 3 1
> > 
> > To get the order of integers I was looking for required an
> > invocation of unique:
> > 
> > as.numeric(factor(apply(mat, 1, paste, collapse=":"),
> >                   levels=unique(apply(mat, 1, paste, 
> collapse=":"))))
> > 
> > But the first method above is obviously cleaner and is twice
> > as fast, only 9 seconds for a 100000 row matrix on an ordinary PC.  
> 
> The interaction solution gives an identical result, is shorter and
> is one or two orders of magnitude faster.  Here is a 
> comparison of the three:
> 
> R> set.seed(1)
> R> mat <- matrix(sample(20,100000,rep=T),50000)
> R> 
> R> f0 <- function(mat) {
> + mat2 <- apply(mat, 1, paste, collapse=":");
> + match(mat2, unique(mat2))
> + }
> R> 
> R> 
> R> f1 <- function(mat) { z <- apply(mat, 1, paste, collapse=":")
> + as.numeric(factor(z,levels=unique(z)))
> + }
> R> 
> R> f2 <- function(mat) as.numeric(interaction(mat[,1],mat[,2],drop=T))
> R> 
> R> dummy <- gc(); system.time(z0 <- f0(mat))
> [1] 5.24 0.02 5.52   NA   NA
> R> dummy <- gc(); system.time(z1 <- f1(mat))
> [1] 5.18 0.00 5.52   NA   NA
> R> dummy <- gc(); system.time(z2 <- f2(mat))
> [1] 0.1 0.0 0.1  NA  NA
> R> all.equal(z0,z1)
> [1] TRUE
> R> all.equal(z0,z2)
> [1] TRUE
> R> all.equal(z2,z1)
> [1] TRUE
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>




More information about the R-help mailing list